Extended U-Net for Speaker Verification in Noisy Environments

Abstract

Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models not tailored to the SV task can also distort the speaker information included in utterances. We argue that the enhancement network and speaker embedding extractor should be fully jointly trained for SV tasks under noisy conditions to alleviate this issue. Therefore, we proposed a U-Net-based integrated framework that simultaneously optimizes speaker identification and feature enhancement losses. Moreover, we analyzed the structural limitations of using U-Net directly for noise SV tasks and further proposed Extended U-Net to reduce these drawbacks. We evaluated the models on the noise-synthesized VoxCeleb1 test set and VOiCES development set recorded in various noisy scenarios. The experimental results demonstrate that the U-Net-based fully joint training framework is more effective than the baseline, and the extended U-Net exhibited state-of-the-art performance versus the recently proposed compensation systems.

Index Terms: speaker verification, noisy environment, feature enhancement, fully joint training, U-Net.

1 Introduction

Recently, deep learning has been applied in various fields with tremendous success, and in speaker verification (SV) tasks, deep neural network (DNN)-based embedding learning methods [1, 2] exhibit satisfactory results compared to traditional schemes [3, 4]. Most embedding extractors perform reliably in clean scenarios but suffer from performance degradation in noisy environments. It is challenging to delicately and accurately extract speaker information from noise-contaminated utterances, because the background noise undermines the intelligibility and quality of speech [5, 6].

Several studies have used a pretrained enhancement model as a front-end module for an embedding extractor (Figure 1 (a)) to address this problem [7, 8, 9]. A denoising autoencoder, a traditional enhancement method, is trained to map noise speech to a clean counterpart using a loss function, such as the mean squared error (MSE) [7, 10]. The denoising autoencoder can improve the auditory quality of noise speech, leading to better results for noisy SV scenarios. Nonetheless, the performance improvement for clean input is marginal, and even independent compensation processes not customized to downstream tasks may distort speaker information [11, 9].

VoiceID loss [11] was proposed to construct an enhancement model specialized for SV tasks. As presented in Figure 1 (b), the mask output from the enhancement model is multiplied by the input feature element-wise to filter out redundant information polluted by noise. Then, the refined feature is fed to the pretrained SV model, and only the weights of the enhancement model are updated to reduce the speaker identification (SID) loss. This framework can generate an enhancement model suitable for SV but cannot jointly optimize both models. Therefore, the fixed SV model cannot continuously learn the modified distributions of enhanced features, which may lead to information mismatch [12].

To prevent the distortion of speaker information and mitigate information discrepancy caused by independent enhancement, we argue that the embedding extractor and enhancement model should be trained in a fully joint, rather than exclusive, manner. The U-Net [13] is an image-to-image translation network proposed in the biomedical image segmentation field based on the encoder-decoder structure, which delivers a processed output with the identical (or analogous) size of the input data. Due to the network’s structural nature, the U-Net has been widely used in speech processing tasks, such as voice activity detection [14] or speech enhancement [15]. Inspired by the architectural characteristics of the U-Net and its successful adaptation for the speech domain, we propose to exploit the U‑Net framework to optimize SID and feature enhancement (FE) losses simultaneously. The proposed U-Net-based system is depicted in Figure 1 (c). This model is jointly trained to classify the speaker identity of the embeddings transformed from the intermediate features while reducing the Euclidean distance between the enhanced features from the decoder and its clean versions. Thus, the U-Net-based system can directly derive a noise-robust speaker embedding through this training process.

Refer to caption — Figure 1: Block diagrams of noisy speech compensation approaches for speaker verification tasks. (a): Pretrained enhancement model, (b): VoiceID loss [11] ($\times$⃝ denote element-wise multiplication), (c): U-Net-based system (Proposal 1), (d): Extended U-Net (Proposal 2)

Although we devised a joint-trainable framework, the proposed U-Net-based system has several structural limitations. First, this system did not use enhanced features from the decoder to extract speaker embeddings. Furthermore, noise compensation is performed only with the encoder in the evaluation phase without using a decoder. Therefore, we hypothesize that the vanilla U-Net is not an ideal fully joint training structure for noise SV tasks and further propose an extended U-Net (ExU-Net) to improve these shortcomings. The ExU-Net (Figure 1 (d)) is jointly trained by combining an additional embedding extractor with the U-Net framework. This extended structure can extract embeddings using the enhanced features and entirely exploit the jointly trained model for the evaluation. Moreover, we applied a metric learning loss for direct embedding enhancement (EE).

The models were trained using VoxCeleb 1 [16] training data and the MUSAN corpus [17]. For evaluation in noisy environments, we used the VoxCeleb 1 test set synthesized from MUSAN and the VOiCES [18] development dataset containing various noise sources. The experiment results demonstrated that the U-Net-based system achieved improved performance compared to the baseline and that the ExU-Net exhibited state-of-the-art results for noisy scenarios.

2 Related work

The SV studies for noisy environments have been conducted from various perspectives. Several researchers have focused on algorithmic methods that compensate for noisy utterances directly at the signal level [19, 20]. At the feature level, data augmentation [6] and feature normalization [21] techniques were used to construct noise-robust SV systems. The SV research for noisy utterances has also been studied at the model level. Using a separate enhancement model as a front-end or back-end system for the embedding extractor can prevent performance deterioration due to noise [11, 22]. Furthermore, Cai et al. [6] proposed specialized loss functions that induce noise-mitigated speaker embeddings, and Wu et al. [12] introduced asynchronous subregion optimization to avoid collision between losses. The present study also explored improving the performance of the noisy SV task at the model level and focused on a fully joint training scheme to alleviate the distortion of speaker information that may occur during noise compensation.

3 Proposed frameworks

To mitigate the adverse effects caused by independent noise compensation, we speculate that the SV system should be trained by simultaneously optimizing the SID and FE losses. This section explicates the proposed fully joint training framework. As illustrated in Figure 2, we describe the models according to the scope of the framework in the following order: baseline, U-Net, and ExU-Net.

3.1 Baseline

The encoder of our proposed frameworks captures latent features including speaker information from the input data. The residual network (ResNet) [23] is one of the most successful DNNs in the past decade and has demonstrated outstanding performance as a speaker embedding extractor in SV tasks [24]. Therefore, we used a ResNet-based system [25] as the encoder and designated it as the baseline for the direct comparison with the proposed models. The structure of the baseline is described in the left two columns of Table 1.

The dotted thin orange arrows in Figure 2 denote the training path of the baseline. We used a mini-batch ( $\mathcal{M}$ ) consisting of clean and noise spectrograms to train the distributions of both data directly.

	$\displaystyle\mathcal{M}=$	$\displaystyle[\ S_{1}^{1},S_{2}^{1},...,S_{n}^{1},$		(1)
		$\displaystyle\ \ \tilde{S}_{1}^{2},\tilde{S}_{2}^{2},...,\tilde{S}_{n}^{2}\ ],$		(1)

where $S_{i}^{j}$ indicates the $j$ th clean spectrogram of the $i$ th speaker, $\tilde{S}$ refers to the noise-synthesized spectrogram, and $n$ is the number of speakers for a single mini-batch. The input is processed into speaker embeddings, $A$ , $\tilde{A}\in\mathbb{R}^{256}$ , which are trained to classify the speaker identity via a categorical cross-entropy (CCE) criterion.

	$\displaystyle A=g(f_{enc}(S)),$		(2)
	$\displaystyle\tilde{A}=g(f_{enc}(\tilde{S})),$		(2)

where $f_{enc}$ indicates the encoder, and $g$ represents the final pooling and fully connected (FC) layers. Therefore, the loss function of the baseline is:

\mathcal{L}_{baseline}=\mathcal{L}_{CCE}.

(3)

3.2 U-Net-based system

This study emphasized the significance of joint learning between the enhancement model and embedding extractor under noisy SV scenarios. Motivated by the speech enhancement research that successfully leveraged the structural characteristics of U-Net, we devised a fully joint training framework for noisy SV tasks based on the U-Net.

The dotted thick green arrows in Figure 2 illustrate the training process of the proposed U-Net-based system, and the structural details are specified in Table 1. The speaker embeddings are extracted with the identical process as the baseline and are trained with the SID task, as in Eq. (2). The decoder network, $f_{dec}$ , derives enhanced spectrograms ( $O$ and $\tilde{O}$ ) of the same size as the input. The decoder uses the output of each encoder block to decode enhanced spectrograms with minimal information loss (dotted gray arrows in Figure 2).

	$\displaystyle O=f_{dec}(f_{enc}(S)),$		(4)
	$\displaystyle\tilde{O}=f_{dec}(f_{enc}(\tilde{S})),$		(4)

The enhanced spectrograms are optimized to reduce the $L$ 2 distance from the clean spectrograms of the input (i. e., the ground truth) using the MSE loss.

\mathcal{L}_{MSE}=\frac{1}{2n}\sum^{n}_{i=1}(||{O}_{i}^{1}-{S}_{i}^{1}||^{2}_{2}+||\tilde{O}_{i}^{2}-S_{i}^{2}||^{2}_{2}),

(5)

The output of clean input is also mapped to the ground truth itself to prevent performance degradation when a clean utterance is input into the compensation system, as reported in [22].

In summary, the U-Net-based system is jointly trained to discriminate speakers and compensate for noisy utterances by simultaneously optimizing the following two losses:

\mathcal{L}_{Unet}=\mathcal{L}_{CCE}+\mathcal{L}_{MSE},

(6)

Through this learning strategy, the proposed model can extract noise-robust speaker embeddings.

Table 1: Architectures of the proposed framework components. Each decoder block (DB) includes a corresponding encoder block (EB) in which the number of input/output channels is reversed. For the convolutional layer (Conv), the numbers inside parentheses indicate the kernel length, padding, and stride sizes, in turn. (SE: squeeze-and-excitation module [26], ASP: attentive statistics pooling [27], TConv: transposed convolution layer)

Layer

Structure

Layer

Structure

Conv

Conv2d(7, 3, 2

\times

DB1

Eb4

Concatenation

Conv2d(1, 1, 1)

EB1

\left[\begin{tabular}[]{c}Conv2d(3, 1, 1)\\ Conv2d(3, 1, 1)\\ SE\\ \end{tabular}\right]\times 3

DB2

Eb3

Concatenation

TConv2d(2, 1, 2)

EB2

\left[\begin{tabular}[]{c}Conv2d(3, 1, 2)\\ Conv2d(3, 1, 1)\\ SE\\ \end{tabular}\right]\times 4

DB3

Eb2

Concatenation

TConv2d(2, 1, 2)

EB3

\left[\begin{tabular}[]{c}Conv2d(3, 1, 2)\\ Conv2d(3, 1, 1)\\ SE\\ \end{tabular}\right]\times 6

DB4

Eb1

Concatenation

Conv2d(1, 1, 1)

EB4

\left[\begin{tabular}[]{c}Conv2d(3, 1, 1)\\ Conv2d(3, 1, 1)\\ SE\\ \end{tabular}\right]\times 3

TConv

Concatenation

TConv2d(2

\times

1, 1, 2

\times

Pooling

ASP

FC(256)

Table 2: Experiment results (EER %,

C_{det}^{min}

) obtained on the VoxCeleb1 test set and the noise scenarios synthesized with the MUSAN corpus under various SNRs ( ^† : drawn from [12]).

Training dataset						Original $D$	Original and noise augmentation $D+D^{N}$
Noise type	SNR	Baseline	Baseline	VoiceID [11] ^†	Wu et. al. [12]	U-Net	Cai et. al. [6]	ExU-Net-L	ExU-Net
# Parameters						1.39M	1.39M	-	-	3.41M	-	1.38M	4.81M
Original test set $\tau$		3.75	3.69	6.79	7.6	3.57	3.12	3.23	2.76
Babble	0	27.73	12.93	37.96	20.11	11.31	11.78	10.9	9.57
	5	14.54	7.23	27.12	12.02	6.62	5.97	6.04	5.52
	10	8.12	5.44	16.66	9.63	4.96	4.44	4.4	4.06
	15	5.45	4.54	11.25	8.48	4.19	3.73	3.66	3.28
	20	4.37	4.06	8.99	7.99	3.85	3.36	3.43	2.99
Music	0	23.4	9.62	16.24	12.92	8.94	7.79	8.03	7.35
	5	13.5	6.6	11.44	10.1	5.83	5.23	5.49	4.9
	10	7.96	5.05	9.13	8.95	4.76	4.11	4.19	3.69
	15	5.54	4.47	8.10	8.35	4.17	3.63	3.78	3.14
	20	4.55	4.06	7.48	7.95	3.72	3.30	3.47	2.93
Noise	0	21.41	9.26	16.56	13.12	8.09	7.34	7.68	6.8
	5	13.43	6.81	12.26	10.57	6.37	5.65	5.85	5.23
	10	8.64	5.42	9.86	9.28	5	4.35	4.6	4.07
	15	6.3	4.51	8.69	8.59	4.29	3.85	3.9	3.39
	20	4.94	4.22	7.83	8.1	3.99	3.44	3.74	3.1
Average (EER / $C_{det}^{min}$ )						10.85 / 0.758	6.12 / 0.31	13.52 / -	10.24 / -	5.6 / 0.315	5.07 / 0.563	5.15 / 0.286	4.55 / 0.254

3.3 ExU-Net

Although the described U-Net-based system can induce discriminative and noise-mitigated embeddings, the enhanced spectrograms extracted from the decoder are not directly exploited for speaker embedding extraction. In other words, the indirect utilization of the decoder for noise compensation training suggests that all parameters of the trained U-Net cannot be fully used for embedding extraction. Therefore, we propose an extended U-Net (ExU-Net) using an additional extractor to alleviate these structural weaknesses.

The solid blue arrows in Figure 2 depict the feedforward flow of ExU-Net. The extractor is identical to the encoder structure, except for the concatenation layer added at the beginning of each block. The speaker embeddings ( $B$ , $\tilde{B}\in\mathbb{R}^{256}$ ) of the proposed ExU-Net are derived from the extractor network, $f_{ext}$ , by directly feeding the enhanced spectrogram from the decoder as follows:

	$\displaystyle B=g(f_{ext}(O))=g(f_{ext}(f_{dec}(f_{enc}(S)))),$		(7)
	$\displaystyle\tilde{B}=g(f_{ext}(\tilde{O}))=g(f_{ext}(f_{dec}(f_{enc}(\tilde{S})))),$		(7)

Thus, the ExU-Net’s SID loss ( $\mathcal{L^{\prime}}_{CCE}$ ) optimizes the speaker embeddings extracted from the enhanced spectrograms, in contrast to Eqs. (2) and (3).

Additionally, considering the characteristics of the SV task comparing the similarity between embeddings, we applied metric learning to directly enhance the embeddings. For the EE loss, we used the angular prototypical network loss (APN) [25].

T_{i,j}=w\cdot cos({B}_{i}^{1},\tilde{B}_{j}^{2})+b,

(8)

where $T_{i,j}$ is the cosine similarity ( $cos$ ) between ${B}_{i}^{1}$ and $\tilde{B}_{j}^{2}$ embeddings with the learnable weight and bias $w$ and $b$ , respectively. The EE loss ( $\mathcal{L}_{APN}$ ) encourages to explore noise-robust and discriminative embedding spaces by decreasing the intra-class variance between noise-clean embedding pairs of the same speaker and increasing the inter-class variance between different speakers.

\mathcal{L}_{APN}=-\frac{1}{n}\sum_{j=1}^{n}log\frac{exp(T_{j,j})}{\sum_{i=1}^{n}exp(T_{i,j})},

(9)

Finally, the ExU-Net is trained jointly to optimize the following three losses:

\mathcal{L}_{ExUnet}=\mathcal{L^{\prime}}_{CCE}+\mathcal{L}_{MSE}+\mathcal{L}_{APN}.

(10)

4 Experimental Setting

4.1 Datasets

Experiments were performed on the VoxCeleb1 [16] and VOiCES [18] datasets. VoxCeleb1 consists of a training set with 148,642 utterances from 1,211 speakers and a test set with 4,874 utterances from 40 speakers. Although the VoxCeleb dataset collected from YouTube videos is moderately noisy, we considered the original utterances clean and generated the noise data additionally. The MUSAN corpus [17] was used for noise sources, and we divided the MUSAN dataset into two nonoverlapping parts for data augmentation in the training and test phases. In training, we used the original VoxCeleb1 training dataset $D$ alone or with the noise set ( $D^{N}$ ) synthesized using the MUSAN training subset with a signal-to-noise ratio (SNR) randomly selected between 0 and 20.

For the evaluation under noise scenarios, we used two test sets. First, for the VoxCeleb1 test set, we used the synthesized test data for each noise type with SNRs in the set {0, 5, 10, 15, 20} using the MUSAN test subset and the original test set $\tau$ . Second, we exploited the VOiCES development dataset of 15,904 audio segments from 196 speakers for different domain noise scenarios. The VOiCES dataset was collected using array microphones at diverse distances and acoustic conditions in rooms of various sizes. We compared the models based on the equal error rate (EER) and minimum detection cost function ( $C_{det}^{min}$ ) using the cosine similarity score, as in [28].

4.2 Implementation Details

We employed a 64-dimensional mel-spectrogram as input extracted with a 1024-point FFT and a Hamming window width of 25 ms, hopped at 10 ms. The mini-batch consisted of one clean and one noisy utterance per 60 randomly selected speakers, totaling 120 utterances. In all experiments, we used the Adam optimizer with a learning rate of 0.001. Each model was trained for 500 epochs, and the learning rate was decreased by 5% every 10 epochs. The batch normalization [29] and rectified linear unit activation [30] functions were applied after convolutional layers. This study did not use any voice activity detection technique. For reproducibility, we provide the experimental codes and weights of the trained model at https://github.com/wngh1187/ExU-Net.

Table 3: Ablation experiments of ExU-Net loss functions.

Systems	Loss functions			Vox1 (EER % / $C_{det}^{min}$ )
Systems	SID	FE	EE	Original	Noises mean
#1	$\times$	MSE	APN	3.62 / 0.25	5.66 / 0.322
#2	CCE	$\times$	APN	2.82 / 0.18	4.83 / 0.282
#3	CCE	MSE	$\times$	3.32 / 0.227	5.38 / 0.318
#4	CCE	MSE	MSE	3.09 / 0.201	4.75 / 0.272
#5	CCE	MSE	APN	2.76 / 0.178	4.67 / 0.26

5 Results

In Table 2, we compare the models with the recently proposed SV systems for noisy environments on the VoxCeleb test set. For the convenience of description, we calculated the average of the EER and $C_{det}^{min}$ derived from all evaluation conditions. The baseline system trained only on the original clean data $D$ has difficulty coping with noise utterances, resulting in significant degradation under noisy scenarios. When comparing the results of the baselines under the two training dataset conditions, simply using augmented training data $D^{N}$ could enhance the robustness to noise utterances (6.12% vs. 10.85%). In addition, the proposed U-Net-based system (U-Net) showed a relative error reduction (RER) of 8.5% based on the average EER compared to the baseline trained using $D+D^{N}$ (5.6% vs. 6.12%). The ExU-Net proposed in this study outperformed other models based on the original test trial and achieved state-of-the-art performance in all noise scenarios, with a 25.7% of RER vs. the baseline (4.55% vs. 6.12%). Furthermore, to demonstrate the superiority of the ExU-Net structure, we additionally constructed a lightweight variant of the ExU-Net (ExU-Net-L) by adjusting the number of parameters similar to the baseline. The ExU-Net-L demonstrated better results than the baseline and even the U-Net and exhibited tolerable performance compared to the recently proposed models. Based on these results, we interpreted that the fully joint training scheme is effective for noisy SV tasks, and the proposed ExU-Net is further robust to noisy environments, including clean scenarios.

Table 3 presents the results of the ablation experiments to analyze the effectiveness and appropriateness of each ExU-Net loss function. The model was evaluated based on the EER and $C_{det}^{min}$ of the original test set and the mean of all noise-type evaluations. Systems 1 and 3 exhibited notable performance degradation compared to the ExU-Net (System 5), whereas System 2 had comparable results even though FE learning was excluded. These results indicate that the SID loss is crucial to the generalization of the proposed model, and noise compensation at the embedding level is more effective than at the feature level. System 4 exhibited greater deterioration in the original evaluation conditions than the noise test sets compared to System 5. These results imply that the MSE used for EE learning can compensate for noise utterances by reducing the Euclidean distance between pairs of clean-noise embeddings, but it cannot consider the inter-class variance between speakers as opposed to APN loss.

Table 4 delivers the results of the VOiCES development set for each model. All models were trained using the VoxCeleb1 training set ( $D+D^{N}$ ). Similar to the previous results, the U-Net displayed improved performance compared with the baseline. In addition, ExU-Net achieved the best performance among all models and exhibited an RER of 15.3% vs. the baseline. Through the results for the evaluation dataset in different domains from the training set, the models demonstrated superior generalization.

Table 4: Results of EER and

C_{det}^{min}

on the VOiCES development set.

	Baseline	VoiceID	Wu et. al.	U-Net	ExU-Net
EER	7.71	13.57	12.51	6.97	6.53
$C_{det}^{min}$	0.44	-	-	0.42	0.38

6 Conclusion

Focusing on the drawbacks of the independent enhancement module for noise utterance in SV tasks, we emphasized the necessity of a fully joint training scheme between noise compensation and SV models. This paper proposed models that simultaneously optimize feature (and embedding) enhancement and SID losses using the U-Net or extending it (ExU-Net). Due to the joint learning of two or three tasks, the proposed models can extract noise-robust speaker embeddings customized for SV tasks. The evaluation of the noise-synthesized VoxCeleb1 test set demonstrated that the U-Net-based system exhibited improved performance compared to the baseline, and ExU-Net achieved state-of-the-art performance. In addition, we proved the effectiveness and validity of the ExU-Net loss functions through ablation experiments. Finally, the proposed models demonstrated outstanding generalization performance in the evaluation of the VOiCES dataset. As a future work, we will evaluate the proposed models in various real-world noise environments.

7 Acknowledgement

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2020R1A2C1007081).

References

[1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 4052–4056.
[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[3] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[5] M. Wölfel and J. McDonough, Distant speech recognition. John Wiley & Sons, 2009.
[6] D. Cai, W. Cai, and M. Li, “Within-sample variability-invariant loss for robust speaker recognition under noisy environments,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6469–6473.
[7] O. Plchot, L. Burget, H. Aronowitz, and P. Matejka, “Audio enhancing with dnn autoencoder for speaker recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5090–5094.
[8] S. E. Eskimez, P. Soufleris, Z. Duan, and W. Heinzelman, “Front-end speech enhancement for commercial speaker verification systems,” Speech Communication, vol. 99, pp. 101–113, 2018.
[9] M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech enhancement using long short-term memory based recurrent neural networks for noise robust speaker verification,” in 2016 IEEE spoken language technology workshop (SLT). IEEE, 2016, pp. 305–311.
[10] O. Novotny, O. Plchot, P. Matejka, and O. Glembek, “On the use of dnn autoencoder for robust speaker recognition,” arXiv preprint arXiv:1811.02938, 2018.
[11] S. Shon, H. Tang, and J. Glass, “Voiceid loss: Speech enhancement for speaker verification,” Interspeech, pp. 2888–2892, 2019.
[12] Y. Wu, L. Wang, K. A. Lee, M. Liu, and J. Dang, “Joint feature enhancement and speaker recognition with multi-objective task-oriented network,” Interspeech, pp. 1089–1093, 2021.
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[14] A. Gusev, V. Volokhov, T. Andzhukaev, S. Novoselov, G. Lavrentyeva, M. Volkova, A. Gazizullina, A. Shulipa, A. Gorlanov, A. Avdeeva et al., “Deep speaker embeddings for far-field speaker recognition on short utterances,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 179–186.
[15] R. Giri, U. Isik, and A. Krishnaswamy, “Attention wave-u-net for speech enhancement,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 249–253.
[16] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” Interspeech, 2017.
[17] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[18] C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gamble, J. Hetherly, C. Stephenson, and K. Ni, “Voices Obscured in Complex Environmental Settings (VOiCES) Corpus,” in Interspeech, 2018, pp. 1566–1570.
[19] B. J. Borgström and A. McCree, “The linear prediction inverse modulation transfer function (lp-imtf) filter for spectral enhancement, with applications to speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 4065–4068.
[20] L. Mošner, P. Matějka, O. Novotnỳ, and J. H. Černockỳ, “Dereverberation and beamforming in far-field speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5254–5258.
[21] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proceedings of 2001 A Speaker Odyssey: The Speaker Recognition Workshop. European Speech Communication Association, 2001, pp. 213–218.
[22] J.-w. Jung, J.-h. Kim, H.-j. Shim, S.-b. Kim, and H.-J. Yu, “Selective deep speaker embedding enhancement for speaker verification,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 171–178.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[24] J. Zhou, T. Jiang, Z. Li, L. Li, and Q. Hong, “Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function.” in Interspeech, 2019, pp. 2883–2887.
[25] J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” Proc. Interspeech 2020, pp. 2977–2981, 2020.
[26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[27] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018.
[28] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1929
[29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
[30] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Icml, 2010.