Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation ^†^†thanks: ^† Corresponding author

Wen Huang^1,2 Yanmei Gu³ Zhiming Wang³ Huijia Zhu³ Yanmin Qian^1† ¹Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
²SJTU Paris Elite lnstitute of Technology, ³Ant Group, Shanghai, China

Abstract

Advances in speech synthesis technologies, like text-to-speech (TTS) and voice conversion (VC), have made detecting deepfake speech increasingly challenging. Spoofing countermeasures often struggle to generalize effectively, particularly when faced with unseen attacks. To address this, we propose a novel strategy that integrates Latent Space Refinement (LSR) and Latent Space Augmentation (LSA) to improve the generalization of deepfake detection systems. LSR introduces multiple learnable prototypes for the spoof class, refining the latent space to better capture the intricate variations within spoofed data. LSA further diversifies spoofed data representations by applying augmentation techniques directly in the latent space, enabling the model to learn a broader range of spoofing patterns. We evaluated our approach on four representative datasets, i.e. ASVspoof 2019 LA, ASVspoof 2021 LA and DF, and In-The-Wild. The results show that LSR and LSA perform well individually, and their integration achieves competitive results, matching or surpassing current state-of-the-art methods.

Index Terms:

audio deepfake detection, anti-spoofing, generalization

I Introduction

With advancements in speech synthesis systems such as text-to-speech (TTS) and voice conversion (VC), detecting deepfake speech has become increasingly challenging. Synthesized data can originate from a wide range of synthesis systems, each with its own distinct characteristics, making it difficult for spoofing countermeasures to generalize effectively. This challenge is exacerbated when detectors encounter unseen deepfake attacks, often leading to significant performance degradation [1, 2].

To enhance generalization in deepfake detectors, one key direction focuses on developing more robust classification models through improved architecture and learning strategies. Recent studies have utilized features extracted from self-supervised speech models such as Wav2vec [3], Whisper [4], and WavLM [5] as front-end inputs for deepfake detection. These models, trained on large-scale and diverse speech data, strengthen the detection process by providing reliable and domain-agnostic features [6]. Beyond improving feature extraction, researchers have also worked to improve the accuracy of back-end classifiers. Traditional binary classification methods often struggle with generalization, particularly when facing distribution mismatches. To address this, one-class learning approaches have been explored, focusing on creating a compact representation of bonafide speech while effectively pushing away spoofed speech, leading to a well-separated and more generalizable feature space [7, 8].

Another promising direction is through data augmentation, which enhances the robustness of the model by exposing it to a wider range of data variations during training. Traditional techniques such as speed perturbation, SpecAugment [9], and codec augmentation have been shown to improve performance. More recent methods, such as Rawboost [10], use signal processing techniques to boost or distort raw audio, leading to significant improvements. There are also augmentation strategies specifically designed for audio deepfake detection. For instance, CpAug [11] employs a copy-paste strategy to generate diverse training samples, while targeted augmentation methods [12] create pseudo-fakes that challenge the decision boundary, thereby increasing the diversity of fake samples. Furthermore, research has shown that using neural vocoders to augment data can further enhance detection performance [13, 14].

Refer to caption — Figure 1: The pipeline of the proposed method, illustrating the process of Latent Space Refinement (LSR) and Latent Space Augmentation (LSA).

Building on these two key directions, we propose a novel strategy of integrating latent space refinement and augmentation to further boost the generalization ability of deepfake detection, as shown in Fig. 1. First, to address the limitations of binary classification in capturing the diverse nature of spoofed audio, we introduce Latent Space Refinement (LSR). In binary classification, models typically assign a single prototype to each class, which oversimplifies the complex variability within spoofed audio. While one-class learning tries to address this by compactly representing the bonafide class and treating others as outliers, it often imposes a rigid boundary that fails to capture the diversity in spoofed data. In contrast, our LSR approach introduces multiple learnable prototypes specifically for the spoof class, refining the latent space to better model the intricate variations within spoofed data. This enhanced representation reduces intra-class variability and allows the model to generalize more effectively across different spoofing attacks.

Second, to further enhance generalization, we apply Latent Space Augmentation (LSA) to diversify spoofed data representations, inspired by successful applications in computer vision [15, 16]. Unlike traditional data augmentation, which focuses on manipulating input data, LSA directly targets the latent space, allowing it to be independent of specific audio-level operations. By applying techniques such as additive noise, affine transformation, batch mixup, and linear interpolation and extrapolation, LSA generates a wide range of spoofed examples that expand the latent space. This expansion helps the model capture more diverse patterns within spoofed data, thereby improving its ability to generalize across different spoofing attacks and enhancing overall detection performance.

Our experimental results confirm the effectiveness of the proposed latent space refinement and augmentation. We evaluated the approach on four representative datasets: ASVspoof 2019 LA [17], ASVspoof 2021 LA and DF [1], and In-The-Wild [2]. The findings show that both LSR and LSA individually contribute to performance improvements, with the integrated system achieving competitive results, matching or surpassing the current state-of-the-art across these diverse benchmarks.

II Methods

II-A Latent Space Refinement

To capture the inherent variations within the spoof class, we introduce multiple learnable prototypes that refine the latent distribution. Assume there are $K$ prototypes for each class, denoted as $\{c_{1},\dots,c_{K}\}$ . For the bonafide class, $K=1$ , while for the spoof class, $K$ is a hyperparameter chosen based on the complexity of the data. To determine the probability of a sample $x$ belonging to a particular class, we compute the maximum cosine similarity between its embedding $z$ and each of the class prototypes :

\cos\theta=\sum_{i=1}^{K}\frac{e^{\langle c_{i},z\rangle\cdot\gamma}}{\sum_{j=1}^{K}e^{\langle c_{i},z\rangle\cdot\gamma}}\langle c_{i},z\rangle

(1)

where $\langle x,y\rangle=\frac{x\cdot y}{\|x\|\|y\|}$ represents the cosine similarity between two vectors, and $\gamma$ is the scaling factor, set to 10. We smooth the maximum operator using a softmax-like operation to prevent sensitivity between multiple prototypes.

To guide the learning of these prototypes, we design a prototype-based classification loss, inspired by the additive angular margin loss [18]:

\mathcal{L}_{proto}(z)=-\log\frac{e^{s(\cos(\theta_{y+m}))}}{e^{s(\cos(\theta_{y}+m))}+e^{s(\cos\theta_{1-y})}}

(2)

Here, $y\in\{0,1\}$ is the label of sample $x$ , $m$ is an angular margin penalty, and $s$ is a scaling factor. This loss function encourages the model to push the embeddings of genuine samples closer to the bonafide prototype and spoofed samples closer to their corresponding prototypes.

While prototypes are learned during the training process, there’s a risk that they may collapse to a single center. To mitigate this, we introduce an intra-class regularization for the spoof prototypes $\{c^{s}\}$ :

\mathcal{L}_{intra}(\{c^{s}\})=\frac{2}{K(K-1)}\sum_{i=1}^{K-1}\sum_{j=i+1}^{K}\langle c^{s}_{i},c^{s}_{j}\rangle

(3)

This regularization term calculates the mean similarity between the spoof prototypes, encouraging them to spread out in the latent space, thereby preventing prototype collapse.

To further enhance the distinction between spoof and bonafide prototypes, we introduce an inter-class regularization term. This term calculates the smoothed maximum cosine similarity between the spoof prototypes $\{c^{s}\}$ and the single bonafide prototype $c^{b}$ :

\mathcal{L}_{inter}(\{c^{s}\},c^{b})=\delta+\sum_{i=1}^{K}\frac{e^{\langle c^{s}_{i},c^{b}\rangle\cdot\gamma}}{\sum_{j=1}^{K}e^{\langle c^{s}_{i},c^{b}\rangle\cdot\gamma}}\langle c^{s}_{i},c^{b}\rangle

(4)

here $\delta$ is a regularization coefficient that prevents the loss from becoming negative.

Hence, the overall objective function for LSR is defined as follows:

\mathcal{L}_{LSR}=\mathcal{L}_{proto}+\mathcal{L}_{intra}+\mathcal{L}_{inter}

(5)

In addition, the LSR loss can be incorporated alongside a binary classification loss, such as Weighted Cross Entropy (WCE), to refine the latent distribution and reduce intra-class variance.

II-B Latent Space Augmentation

While multi-prototypical refinement enhances the representation of the spoofed class, further generalization can be achieved by augmenting the diversity of the training data. Instead of solely augmenting raw input data, we apply augmentation directly in the latent space, where lower dimensionality allows for more targeted variations. By focusing these augmentations on spoofed latent features, we generate diverse spoofing variations. Notably, these augmentations are not applied to bonafide latent features, preserving their authenticity.

Given $z$ a batch of embeddings, we denote the spoof embeddings in this batch as $z^{s}$ and the bonafide embeddings as $z^{b}$ . To create diverse variations of spoof embeddings, we design five latent augmentation patterns for $z^{s}$ :

Additive Noise (AN). A simple yet efficient idea is to add random perturbation to latent features. Here we apply the additive noise drawn from a Gaussian distribution as follows:

\hat{z^{s}}=z^{s}+\beta\cdot X,X\sim\mathcal{N}(0,\mathbf{I})

(6)

where $\mathcal{N}(0,\mathbf{I})$ is the standard normal distribution, $\mathbf{I}$ is the identity matrix, and $\beta$ is a scaling factor sampled from $\mathcal{N}(0,1)$ .

Affine Transformation (AT). This common transformation for 1D vectors involves scaling and translating the latent features:

\hat{z^{s}}=a\cdot z^{s}+b

(7)

where $a$ is sampled from $\mathcal{U}(0.9,1.1)$ and $b$ is set to 0.

Batch Mixup (BM). Inspired by data mixup strategies [19], we creates new latent features by blending pairs of spoof features in the batch, creating smoother transitions and intermediate variations:

\hat{z_{i}^{s}}=\alpha\cdot z_{i}^{s}+(1-\alpha)\cdot z_{\pi(i)}^{s}

(8)

where $i$ indexes the batch, $\pi$ denotes a random permutation of the batch indices and $\alpha$ is a mixup coefficient sampled from $\mathrm{Beta}(0.5,0.5)$ .

The following two techniques rely on the prototypes learned in latent space refinement:

Linear Interpolation (LI). To create more challenging examples targeting the decision boundary, we perform linear interpolation on spoof embeddings towards bonafide prototype $c^{b}$ . Since the prototypes in LSR the prototypes are normalized to lie on a unit hypersphere due to the use of cosine similarity, the norm of the vectors is incorporated to adjust for the transition to Euclidean space:

\hat{z^{s}}=z^{s}+\lambda_{i}\cdot(\frac{\|z^{s}\|}{\|c^{b}\|}c^{b}-z^{s})

(9)

where $\lambda_{i}$ is an interpolation coefficient sampled from $\mathcal{U}(0,0.1)$ , and the norm term $\|z^{s}\|/\|c^{b}\|$ aligns the scales of the vectors.

Linear Extrapolation (LE). In addition to interpolation, we also perform extrapolation from the nearest spoof prototype to create new features:

\hat{z^{s}}=z^{s}+\lambda_{e}\cdot(z^{s}-\frac{\|z^{s}\|}{\|c^{s}_{n}\|}c^{s}_{n})

(10)

where $c^{s}_{n}$ corresponds the nearest spoof prototype of $z^{s}$ and $\lambda_{e}$ is an extrapolation coefficient sampled from $\mathcal{U}(0,0.1)$ . Similarly, we use the norm $\|z^{s}\|/\|c^{s}_{n}\|$ to adjust for the Euclidean representation. This method extends the spoof features further away from the nearest prototype, generating more diverse variations.

Finally, the augmented latent features $\hat{z^{s}}$ are concatenated with the original features $z$ , forming $z^{\prime}=[z\parallel\hat{z^{s}}]$ . These enhanced features are then used for loss calculation during subsequent training, allowing the model to learn from a more varied set of spoofed data.

III Experiments

III-A Experimental Settings

Datasets and metrics. We train all systems using the ASVspoof 2019 LA training set [17], which includes approximately 25k utterances and 6 spoofing attacks involving VC or TTS. To evaluate generalization performance, we test on multiple datasets: the ASVspoof 2019 LA evaluation set (19LA) [17], containing 71k utterances with 13 different spoofing attacks; the ASVspoof 2021 LA set (21LA) [1], comprising about 181k utterances with algorithms similar to 19LA but also reflecting telephony systems’ encoding and transmission effects; the ASVspoof 2021 DF set (21DF) [1], with over 600k utterances and more than 100 spoofing attacks processed with various lossy codecs; and the In-The-Wild dataset (ITW) [2], which features approximately 32k utterances collected under real-world, non-controlled conditions, making it a more challenging dataset. Performance is measured using Equal Error Rate (EER).

Training details. We adopt the model architecture from [6], utilizing Wav2Vec2.0 XLSR [3] as the frontend feature extractor and AASIST [20] as the backend classifier. Input speech is randomly chunked into 4-second segments, with Rawboost [10] applied as basic augmentation and codec augmentation as extra augmentation. The learning rate is set to 1e-6 for the backbone model and 1e-3 for the prototypes in LSR. For the LSR loss, we set the scaling factor $s=32$ , angular margin $m=0.2$ , and regularization coefficient $\delta=0.2$ . For the WCE loss, the weights for bonafide and spoof classes are set to 0.9 and 0.1, respectively. For LSA, we either fix one type of augmentation during training or randomly select from all augmentation types (denoted as All).

III-B Overall Performance Comparison

TABLE I: Overall performance comparison in EER(%) across multiple datasets. All systems are trained on the ASVspoof2019 LA training set. Best results are highlighted in bold, and second-best results are underlined.

System	19LA	21LA	21DF	ITW
WavLM+AttM [21]	0.65	3.50	3.19	-
Wav2Vec+LogReg [22]	0.50	-	-	7.20
WavLM+MFA [23]	0.42	5.08	2.56	-
Wav2Vec+VIB [24]	0.40	4.92	-	-
OCKD [25]	0.39	0.90	2.27	7.68
GFL-FAD [26]	0.25	-	-	-
Wav2Vec+Linear [13]	0.22	3.63	3.65	16.17
OC+ACS [8]	0.17	1.30	2.19	-
Wav2Vec+AASIST [6]	-	0.82	2.85	-
Wav2Vec+AASIST2 [27]	0.15	1.61	2.77	-
Wav2vec+Conformer+TCM [28]	-	1.03	2.06	-
Wav2vec+STJ-GAT+BLDL^⋆ [29]	0.06	0.56	1.89	-
LSR	0.19	2.35	3.01	6.58
LSR+LSA	0.15	1.19	2.43	5.92
LSR+LSA^⋆	0.12	1.05	1.86	5.54

•

^⋆ with extra data augmentation.

To evaluate the overall performance of the proposed methods, we tested the system on four datasets and compared the results with those from the literature that used the same training dataset, as shown in Table I. Across all datasets, LSR+LSA consistently outperforms LSR alone and often ranks among the top performers, highlighting the effectiveness of integrating latent space refinement with latent space augmentation. To further enhance the results, we applied additional data augmentation, which led to EERs of 0.12% on 19LA, 1.05% on 21LA, 1.86% on 21DF, and 5.54% on ITW. This places our method on par with, or ahead of, the current state-of-the-art methods. Notably, our method focuses on refining and augmenting the latent space, which contrasts with recent approaches that focus on modifying the model architecture [28, 29]. These two strategies—latent space manipulation and architectural improvements—target different aspects of the problem and could potentially be combined for even better results. This highlights the flexibility and advantage of our method, as it enhances generalization without needing to alter the underlying model architecture. In summary, the proposed LSR+LSA method consistently delivers strong results, matching or outperforming state-of-the-art performance across various datasets, demonstrating its robustness and effectiveness in generalizing across diverse deepfake detection tasks.

III-C Ablation Study on Latent Space Refinement

TABLE II: EER (%) across datasets for systems trained with different loss configurations. Best results are in bold, and second-best results are underlined.

Loss Configuration	19LA	21LA	21DF	ITW	Avg.
WCE	0.30	2.64	4.74	8.09	3.94
OC Softmax	0.31	1.60	4.06	7.86	3.46
LSR	0.23	1.55	3.22	7.45	3.11
w/o $\mathcal{L}_{inter}$	0.23	1.84	3.30	7.84	3.30
w/o $\mathcal{L}_{intra}$	0.27	2.62	4.02	7.75	3.67
w/o $\mathcal{L}_{intra}$ , $\mathcal{L}_{inter}$	0.32	2.86	4.11	8.05	3.84
WCE+LSR	0.19	2.35	3.01	6.58	3.03

Table II presents the performance of various loss configurations during training. The baseline configuration uses weighted cross entropy (WCE) loss for binary classification, with OC Softmax [7] included for comparison. Incorporating Latent Space Refinement (LSR) improves performance over both WCE and OC Softmax. We further examine the effects of LSR’s loss terms. Removing inter-class regularization results in minimal degradation, indicating that the core prototype-based loss sufficiently handles prototype separation. However, removing intra-class regularization significantly reduces performance, as this term is crucial for maintaining prototype diversity within the spoof class and preventing collapse. When both regularizations are removed, performance drops to baseline levels. Additionally, combining LSR with WCE yields the best overall results. While WCE provides a solid foundation for binary classification, LSR refines the latent space to better capture variations in spoofed data. This combination leads to improved generalization across the datasets.

Meanwhile, we evaluated the impact of the number of prototypes on performance, as shown in Fig. 3. Increasing the prototypes from 1 to 8 improves performance, but further increasing to 16 shows diminishing returns. At 20 prototypes, performance declines, suggesting that too many prototypes can hinder generalization.

III-D Ablation Study on Latent Space Augmentation

TABLE III: EER(%) across datasets for systems trained with different latent space augmentation. Best results are in bold, and second-best results are underlined.

Method	19LA	21LA	21DF	ITW	Avg.
LSR	0.19	2.35	3.01	6.58	3.03
+LSA(AN)	0.16	1.67	2.85	6.17	2.71
+LSA(AT)	0.19	1.62	2.57	6.69	2.77
+LSA(BM)	0.21	1.65	2.86	6.61	2.93
+LSA(LI)	0.23	1.92	2.65	7.05	2.96
+LSA(LE)	0.18	1.52	2.54	6.15	2.60
+LSA(All)	0.15	1.19	2.43	5.92	2.42

To assess the impact of different latent space augmentation methods, we conducted experiments for each method, as summarized in Table III, and visualized their effects using t-SNE in Fig. 2. Notably, since LI and LE rely on LSR prototypes, all systems were trained with LSR+WCE loss. Among the first three augmentations that are independent of the prototypes, AN and AT produced more dispersed and varied distributions, leading to better performance. In contrast, BM’s distribution remained closer to the original due to its mixup nature, which limited its effectiveness. For the prototype-dependent augmentations, LI, while beneficial, underperformed compared to the others, likely due to the consistent generation of challenging examples. LE, however, achieved the best results, as it effectively expanded the distribution into new regions of the latent space, offering a more balanced diversity. Ultimately, combining all augmentation methods led to the most diverse latent space, resulting in the highest overall performance.

TABLE IV: Comparison of augmentation effects in input vs. latent space across datasets (EER %).

Method	Space	19LA	21LA	21DF	ITW	Avg.
None	-	0.30	2.64	4.74	8.09	3.94
AN	input	0.25	2.22	3.17	6.35	3.00
AN	latent	0.23	2.05	2.84	6.21	2.83
AT	input	0.27	2.43	3.44	6.81	3.24
AT	latent	0.25	2.03	2.91	6.72	2.98
BM	input	0.19	2.24	3.01	6.33	2.94
BM	latent	0.19	2.21	2.95	6.56	2.98

While we have demonstrated the effectiveness of augmentation in latent space, we were curious whether applying the same augmentations in the input space could yield comparable or even better results. To explore this, we conducted comparison experiments between augmentations applied in the input space versus the latent space, focusing on three methods that do not depend on latent prototypes or embeddings: AN, AT, and BM. All experiments were conducted using WCE loss without LSR. As shown in Table IV, applying augmentation, whether in the input or latent space, improves the baseline to some extent. For AN and AT, augmentations performed in the latent space consistently yield better results than those in the input space. This suggests that latent space augmentations may more effectively capture the underlying data distributions that the model needs to learn. Interestingly, BM yields better results when applied in the input space than in the latent space. This outcome may be attributed to the nature of Mixup augmentation, which has been widely proven effective in various audio-related tasks when performed on the input data. The input space BM likely benefits from preserving more of the original data characteristics while still introducing beneficial variability.

IV Conclusions

This paper presents a novel approach to enhance the generalization of audio deepfake detection systems by integrating Latent Space Refinement (LSR) and Latent Space Augmentation (LSA). LSR introduces multiple learnable prototypes to better capture the complex intra-class variability of spoofed audio, while LSA generates diverse representations in the latent space, further strengthening the model’s robustness. Extensive experiments on multiple datasets, including ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-The-Wild, demonstrate that each of the proposed LSR and LSA can improve system significantly.

Acknowledgment

This work was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 62122050 and 62071288, and the Shanghai Municipal Science and Technology Commission under Grant 2021SHZDZX0102. Additional support was provided by the Pioneer R&D Program of Zhejiang Province (No. 2024C01024) and the Ant Group Research Intern Program.

References

[1] Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge, 2021.
[2] Nicolas M Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin Böttinger, “Does audio deepfake detection generalize?,” Interspeech, 2022.
[3] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli, “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv, vol. abs/2111.09296, 2021.
[4] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518.
[5] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[6] Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022.
[7] You Zhang, Fei Jiang, and Zhiyao Duan, “One-class learning towards synthetic voice spoofing detection,” IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021.
[8] Hyun Myung Kim, Kangwook Jang, and Hoirin Kim, “One-class learning with adaptive centroid shift for audio deepfake detection,” in Interspeech 2024, 2024, pp. 4853–4857.
[9] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
[10] Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386.
[11] Linjuan Zhang, Kong Aik Lee, Lin Zhang, Longbiao Wang, and Baoning Niu, “Cpaug: Refining copy-paste augmentation for speech anti-spoofing,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10996–11000.
[12] Marcella ASTRID, Enjie GHORBEL, and Djamila AOUADA, “Targeted augmented data for audio deepfake detection,” in 32nd European Signal Processing Conference (EUSIPCO 2024), 2024.
[13] Xin Wang and Junichi Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[14] Xin Wang and Junichi Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10311–10315.
[15] Xiaofeng Liu, Yang Zou, Lingsheng Kong, Zhihui Diao, Junliang Yan, Jun Wang, Site Li, Ping Jia, and Jane You, “Data augmentation via latent space interpolation for image classification,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 728–733.
[16] Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu, “Transcending forgery specificity with latent space augmentation for generalizable deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8984–8994.
[17] Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, pp. 101114, 2020.
[18] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
[19] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[20] Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371.
[21] Zihan Pan, Tianchi Liu, Hardik B. Sailor, and Qiongqiong Wang, “Attentive merging of hidden embeddings from pre-trained speech model for anti-spoofing detection,” in Interspeech 2024, 2024, pp. 2090–2094.
[22] Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, and Horia Cucu, “Towards generalisable and calibrated audio deepfake detection with self-supervised representations,” in Interspeech 2024, 2024, pp. 4828–4832.
[23] Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, and Yuehai Wang, “Audio deepfake detection with self-supervised wavlm and multi-fusion attentive classifier,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12702–12706.
[24] Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, and Hoi Rin Kim, “Anti-spoofing using transfer learning with variational information bottleneck,” in Interspeech 2022, 2022, pp. 3568–3572.
[25] Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, and Pengyuan Zhang, “One-class knowledge distillation for spoofing speech detection,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11251–11255.
[26] Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, and Shuchen Shi, “Genuine-focused learning using mask autoencoder for generalized fake audio detection,” in Interspeech 2024, 2024, pp. 4848–4852.
[27] Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, and Pengyuan Zhang, “Improving short utterance anti-spoofing with aasist2,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11636–11640.
[28] Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, and Eng Siong Chng, “Temporal-channel modeling in multi-head self-attention for synthetic speech detection,” in Interspeech 2024, 2024, pp. 537–541.
[29] Haochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, and Jie Zhang, “Spoofing speech detection by modeling local spectro-temporal and long-term dependency,” in Interspeech 2024, 2024, pp. 507–511.

Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation ††thanks: † Corresponding author