On the Impact of Voice Anonymization on Speech Diagnostic Applications: a Case Study on COVID-19 Detection

Yi Zhu, Mohamed Imoussaïne-Aïkous, Carolyn Côté-Lussier, and Tiago H. Falk The authors are with the Institut national de la recherche scientifique, University of Québec, Montréal, Canada. Our code and voice demos are made available at https://github.com/zhu00121/Anonymized-speech-diagnostics.

Abstract

With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and para-linguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the para-linguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test three anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Additionally, we provided a comprehensive evaluation of the importance of different speech aspects for diagnostics and showed how they are affected by different types of anonymizers. Lastly, we show the benefits of using anonymized external data as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymization.

Index Terms:

Voice anonymization, health diagnostics, COVID-19 detection

I Introduction

Speech is one of the most powerful and easy-to-use communication interfaces between humans and machines. For example, voice assistants relying on automatic speech recognition (ASR) allow humans to control devices by providing voice commands [1]; automatic speaker verification (ASV) systems enable users to access personal properties (e.g., online bank accounts) via their voice [2]. More recently, speech has also been shown as a promising measure for in-home disease detection and monitoring, including COVID-19 [3], chronic obstructive pulmonary disease (COPD) [4], and Alzheimer’s disease [5], just to name a few.

Speech-based diagnostic systems have been motivated by the fact that speech requires complex and precise coordination of the respiratory system and neuromuscular control [6]. Diseases that cause dysfunction in speech production would then lead to changes in vocal characteristics. For example, major symptoms of COVID-19, such as cough, muscle soreness, and decreased neuromuscular control [7, 8, 9], have been shown to relate to increased vocal hoarseness and variance in syllabic rate [10]. While human ears may not be able to capture such subtle changes, machine learning (ML) models have demonstrated the capability to detect certain abnormal patterns present in pathological speech [10, 11, 12].

Today, the great majority of speech-based applications rely on deep neural network (DNN) architectures with models containing hundreds of millions of parameters, with this number continuously rising. Commonly, these parameters are not stored locally on mobile devices [13] and speech data are sent to and processed in the cloud; decisions are then transmitted back to the user device. As more and more cases of cyberattacks are being reported [14, 15, 16], this transmission of speech data over the cloud could pose serious threats to user privacy. It has been previously reported that voice assistants and many third-party applications collect users’ voices without their knowledge and share it with advertising partners [17]. For example, Amazon patented a technique which recognizes health status via conversations with users and advertises the related medicines to them [18]. This could be particularly risky for speech diagnostics applications, since the user’s voice could be linked with sensitive medical information, such as health status[19], disease progression[20], or mental state [21], just to name a few. As such, speech privacy-preserving methods have gained increased attention globally, especially with the release of regulations, such as the General Data Protection Regulation (GDPR) in Europe [22] and the Personal Information Protection Law (PIPL) in China [23]; the latter is particularly aimed at personal biometrics (i.e., voice, facial image, and fingerprints).

Alternately, voice anonymization methods have emerged with the aim of manipulating the speech signal such that information about speaker identity is obfuscated, while the linguistic content and other para-linguistic attributes (e.g., timbre, naturalness) remain intact. Given the burgeoning interest in this domain, the Voice Privacy Challenges (VPC) were held in 2020 and 2022 to foster development in speech anonymization techniques [24, 25]. However, these challenges were aimed at developing anonymization methods for downstream automatic speech recognition tasks [24, 25, 26, 27], where linguistic content was preserved, but not para-linguistic information.

As speech applications emerge beyond the realm of ASR, it is important to also gauge what impacts anonymization tools can have on other downstream tasks. Some initial attempts have been made in this realm. Nourtel et al. showed significant degradation in speech emotion recognition when anonymization was applied [28]. Dumpala et al. performed an initial exploration of the privacy-preserving features of depression speech [29]. To the best of our knowledge, gauging the impact of anonymization on speech diagnostic applications has yet to be explored; this paper aims to fill that gap.

Furthermore, in a real-world scenario, diagnostics models are usually trained on open-source datasets due to the scarcity of medical data [30], while test data may come from varying conditions (e.g., geographic locations, languages, collection devices, etc.). Hence, it is difficult to have training and test data anonymized using the exact same approach in reality. However, existing anonymization testing conditions commonly assume that downstream models either have no or full knowledge of how the training and test data are anonymized (i.e., ignorant or fully-informed, respectively). As more anonymization techniques emerge, alternate testing conditions could be implemented, such as training with data processed by other anonymization methods (i.e., in a semi-informed manner) or with both original and conventional anonymization tools (i.e., augmented). Hence, more complex testing conditions need to be considered. Lastly, to avoid private information being sent to the cloud, the voice anonymization should be deployed locally on the user device, which could have limited computational resources. As such, it is important to evaluate the computational complexity (i.e., time and capacity needed for computation) of the anonymization methods alongside their effectiveness.

In this study, we comprehensively evaluated the impact of three voice anonymization methods on the accuracy of five leading COVID-19 detection systems. We started by quantifying the efficacy and computational complexity of the anonymization methods with COVID-19 speech recordings. We then investigated the within and cross-dataset performance of five COVID-19 diagnostics systems in different conditions, and explored the reasons behind the impact of different anonymization methods on diagnostics. Lastly, we showed the benefits of using anonymized external data as a data augmentation tool to recover the diagnostics accuracy loss in anonymized data. The following paper is organized as follows. Section II summarizes the related works in speech-based COVID-19 diagnostics and speech anonymization. Section III and IV describe the main components of the anonymized speech diagnostics framework and the experimental set-up. Section V describes and discusses the obtained results. Section VI presents the conclusions.

II Related Work

II-A Speech-based COVID-19 Diagnostics

Speech-based diagnostic systems can be categorized into two groups: ones that rely on carefully designed hand-crafted features coupled with conventional machine learning classifiers, and ones that input raw signals directly into a deep learning model for classification. In the latter ‘end-to-end’ scenario, the deep learning model serves as a feature extractor and feature mapping function in one.

When it comes to feature extraction from speech, the openSMILE toolkit [31] is by far the most popular. The largest feature set of openSMILE extracts over 6,000 acoustic features, including mel-frequency cepstral coefficients (MFCC), pitch contours, voicing-related information, as well as several other low-level descriptors (LLDs). This feature set has been used together with conventional classifiers, such as support vector machines (SVM), for the detection of different diseases [32, 33, 34, 35]. More recently, it has been employed as a benchmark feature set for the INTERSPEECH 2021 ComParE COVID-19 Detection Challenge [36]. For in-the-wild speech analysis, on the other hand, the modulation spectral representation (MSR) has shown benefits over openSMILE features for different applications (e.g., [37, 38]), including disease characterization (e.g., [39, 40]) and COVID-19 detection [10].

Existing end-to-end systems, in turn, have relied on variants of the spectrogram representation as input, including the mel-spectrogram or the log-mel-spectrogram, as well as convolutional or recurrent neural network architectures for classification. Han et al., for example, showed that VGGish neural networks outperformed conventional methods in classifying different COVID-19 symptoms [35]. Akman et al. developed a ResNet-like architecture for speech and cough-based COVID-19 detection [41]. The Bi-directional Long-Short-Term-Memory (BiLSTM) neural network was used in the top-performing system competing in the second Diagnosis of COVID-19 using Acoustics (DiCOVA2) Challenge [12]. Compared to conventional systems, end-to-end systems have demonstrated overall higher performance on several datasets without the need for a separate feature extraction step [42, 43, 12]. Nonetheless, recent research has shown that while end-to-end models achieve state-of-the-art accuracy on a particular dataset, those results do not transfer well to other unseen datasets, where accuracy can drop to below chance levels [44]; this was not the case with hand-crafted features and conventional classifiers.

II-B Speech Anonymization

Anonymization techniques comprise two categories: speech transformation and speech conversion. The former refers to modifications directly to the original speech, such as pitch shifting and warping [45, 46], to remove personal identifiable information from the speech signal. The latter, in turn, converts one’s voice to sound like that of another without changes in linguistic content [47]. As voice privacy concerns are on the rise, voice anonymization has gained popularity recently and, in 2020, the Voice Privacy Challenge (VPC) was created [24]. A popular method from the 2020 and 2022 VPCs employs the so-called McAdams coefficients [24, 25], where shifts in the pole positions derived from linear predictive coding (LPC) analysis of speech signals [48] are used to achieve anonymization. Another popular voice transformation method is termed voicemask [49], where certain frequency components are compressed (or stretched) to generate a lower-pitched (or higher-pitched) voice signal. Voice conversion systems, on the other hand, have usually relied on modifications to speaker embeddings, such as the x-vector [50] and the ECAPA-TDNN embeddings [51], which are assumed to only carry nonverbal information that pertains to the speaker identity alone. The modified speaker embeddings are then input with speech content sequence to a speech synthesis module to reconstruct a new speech waveform [26]. Several innovations have been proposed to the speech synthesis module to make the outcome sound more natural and of greater quality and intelligibility [52, 53, 54, 27].

III Anonymized Speech Diagnostics Systems

III-A System Overview

Figure 1 depicts the diagram of an anonymized speech diagnostics (SD) system. Conventionally, the original voice of user X is input to a diagnostic system that will generate a positive or negative output for the tested disease and/or symptom. If an automatic speaker verification (ASV) system was trained with data from user X, the ASV system would be able to detect user X’s voice. In practice, SD systems are complex and models are often stored on the cloud, thus requiring the user’s voice (or features) to be uploaded to the cloud. This transmission of data could result in privacy concerns. To overcome this, voice anonymization can be employed locally and anonymized data (or features) are sent to the cloud. In this case, user X would not be identified by the ASV system and speech-based diagnostics could proceed in a more secure and private manner.

Refer to caption — Figure 1: Block diagram of a speech-based diagnostics system with (protected) and without (unprotected) anonymization. ‘SD’ stands for speech-based diagnostic system and ‘ASV’ for automatic speaker verification.

III-B Speech-Based Diagnostic Systems

Based on previous experiments on COVID-19 detection (e.g., [10]), the five top-performing diagnostics systems are explored herein:

III-B1 openSMILE+SVM

A total of 6,373 static acoustics features were firstly extracted using the openSMILE toolbox [31], which were then input to a SVM classifier with a linear kernel. This system was used as the benchmark in the 2021 ComParE COVID-19 Speech Sub-challenge [36].

III-B2 openSMILE+PCA+SVM

The high dimensionality of the openSMILE features can be problematic for smaller datasets. In [55], principal component analysis (PCA) [56] was used to compress the 6,000+ features into 300 components. Here, the number of principal components was treated as a hyper-parameter and a value of 100 was found to strike a good balance in accuracy and dimensionality.

III-B3 MSR+SVM

The MSR features have been used in [10, 44] and shown to outperform openSMILE-based systems and to provide improved generalizability across datasets. The interested reader is referred to [39, 57] for more details about the modulation spectrum. The modulation spectrum decomposes each frequency component along time into different modulation frequencies, which captures the abnormalities in respiration and articulation by focusing on long-term dynamics of speech. Each modulation spectrum comprises 23 frequency bins and 8 modulation frequency bins, which is then flattened into a vector and used as input to a linear SVM classifier.

III-B4 MSR+PCA+SVM

For more direct comparisons with the openSMILE system, here we also explore the compression of the 184-dimensional ( $23\times 8$ ) vector via PCA, resulting in a final 100-dimensional vector for classification.

III-B5 Logmelspec+BiLSTM

The winning system in the DiCOVA2 Challenge was employed [12] as a benchmark. This system adopts the conventional log-mel-spectrogram (logmelspec) with first-and second-order deltas as input, along with a BiLSTM as the classifier. More details about the network architecture can be found in [12].

III-C Speech Anonymization Methods

A voice transformation and two voice conversion methods are explored here to gauge their differences in speech diagnostics performance. More details are provided below.

III-C1 McAdams coefficient

This approach uses a classical signal processing technique and does not require model training. It employs the so-called McAdams coefficient method [58, 48] to shift the position of formants measured using linear predictive coding (LPC) [59]. For each short-time speech frame, the method first separates the linear prediction residuals and linear prediction (LP) coefficients. The LP coefficients are then converted to pole positions in the z-plane by polynomial root-finding, where each pole position represents the position of one formant. The phase of the poles with imaginary parts is then raised to the power of the McAdams coefficient $\alpha$ . The new set of poles is then converted back to LP coefficients. Together with the original residuals, a new speech frame can be synthesized.

III-C2 Ling-GAN

For voice conversion, we implemented two systems based on generative adversarial networks (GAN). The overall architecture of these systems can be found in Figure 2. The first system, abbreviated as ‘Ling-GAN’, was an off-the-shelf anonymizer from [27], where all modules were already trained and applied to COVID-19 data without any fine-tuning. In general, it preserves the linguistic content (i.e., phoneme sequence) and uses a generator to generate fake, yet realistic speaker embeddings to substitute the original speaker embeddings. The original speech is first input to an automatic speech recognition (ASR) model to extract the phone sequence. The ASR model used here is based on the hybrid CTC (Connectionist Temporal Classification)/attention architecture [60] with a Conformer encoder [61] and a Transformer decoder. It should be emphasized that the output of the ASR is a phoneme sequence, detailing not only the phonemes uttered but also the pauses. In our exploratory analysis, we found that the removal of these pauses would change the rhythm of the generated speech and lead to degraded diagnostic performance. We hence kept all pauses in the extracted phoneme sequences. The ASR model used here supports English as the default input language, hence may lead to erroneous transcriptions when other languages are used. Although such issue can be potentially tackled by replacing with other multi-language ASR models, their compatibility with the anonymization and synthesizer blocks has not been tested. Hence, we remain using the same architecture as is proposed in [27], and leave the language compatibility for future investigation.

The anonymization is divided into two stages. During the first stage, the 512-dimensional x-vector [50] and the 192-dimensional ECAPA-TDNN vector [51] are extracted using the SpeechBrain toolkit [62] and concatenated as the final speaker embeddings. At the second stage, a Wasserstein GAN with Quadratic Transport Cost (WGAN-QC) [63] is used to generate a pool of 5,000 ‘converted’ speaker embeddings and saved for later use. When a new recording is input to the system, the model iteratively looks through the pool, and stops when it finds one with a cosine distance above 0.3 with the original speaker embeddings. This set of new embeddings are then used to substitute the original one for synthesis. The 0.3 threshold value of cosine distance was suggested from [27], which ensures sufficient difference in speaker traits while maintaining the naturalness. Finally, the FastSpeech 2 model [64] is used to synthesize the phone sequence into a spectrogram, followed by a HiFiGAN vocoder [52] to convert the spectrogram into a final speech waveform. The synthesizer is conditioned on the anonymized speaker embedding, hence keeping the linguistic content while obfuscating the speaker identity.

It is important to emphasize that this off-the-shelf GAN has not seen pathological speech data during its training [65]. As a consequence, the generated speaker embeddings may not encapsulate health-related attributes, thus affecting diagnostic accuracy. The last anonymization system used overcomes this limitation, as detailed next.

III-C3 Ling-Pros-GAN

The second GAN-based system, abbreviated as ‘Ling-Pros-GAN’, was modified from [65] which can be seen as a more advanced version of the Ling-GAN. While sharing similar architecture, such as the ASR module and the synthesizer, the Ling-Pros-GAN further preserves prosody (i.e., pitch, energy, and duration) during anonymization and uses the style embeddings from [66] to represent speaker attributes. In addition, we fine-tuned the generator and discriminator using the aggregated training set data from all three COVID-19 datasets employed in this study. The goal of fine-tuning was to enable the GAN to generate COVID-like speaker embeddings.

The generator and discriminator were jointly trained via 2,000 iterations, with the batch size of 128 and learning rate of .00005. Other fine-tuning hyperparameters remained the same as reported in [65], which can also be found in our code repository. Figure 3 depicts the t-distributed stochastic neighbor embedding (t-SNE) plots [67] showing a 2-dimensional representation of the speaker embeddings in the COVID-19 datasets (red dots), those produced by the generator without fine-tuning (blue), and after fine-tuning (green). As can be seen, using just the pre-trained generator is not sufficient to model the COVID-19 speaker embedding distribution. With 2,000 iterations of fine-tuning, the generator was able to generate embeddings following a similar distribution of the COVID-19 embeddings.

Different from the original implementation in [65], where a pre-generated pool of speaker embeddings were used, we modified Ling-Pros-GAN in a way that it randomly generates a small set of different speaker embeddings each time it receives a new recording, then chooses which embeddings to swap by iteratively examining the cosine similarity. In other words, Ling-Pros-GAN is guaranteed to generate an unseen version of anonymized speech even with the exact same input recording. In contrary, since Ling-GAN always chooses embeddings from a pre-generated pool, there is a slight chance that two recordings may be anonymized with the same generated embeddings. Such possibility becomes higher when the number of speakers increases. While such modification to Ling-pros-GAN improves the privacy, the computing time increases simultaneously due to the online generation process of speaker embeddings.

IV Experimental Setup

IV-A Databases

At the time of writing, most existing COVID-19 sound datasets target cough sound, such as the COUGHVID [68], Tos COVID-19 [69], Virufy [70], and NoCoCoDa [71]. Speech sound, on the other hand, is included in fewer datasets. To maximize the variability of data distribution and avoid biased results from one single dataset, we included three publicly available COVID-19 speech datasets, namely the multilingual 2021 ComParE COVID-19 Speech Sub-challenge (CSS) dataset [11], the second DiCOVA Challenge dataset [12], and the English subset from the Cambridge COVID-19 sound database [55]. These datasets are referred to hereinafter as CSS, DiCOVA2, and Cambridge set, respectively. The demographics of three datasets are summarized in Table I. It should be noted that though the full Cambridge database contains more speech samples, the English subset has been more carefully examined by the data holders to avoid potential confounding factors (e.g., languages, data quality, class balance, etc.) [55], hence is considered more suitable for our analysis.

All three datasets were crowdsourced, volunteers across the globe were encouraged to upload their voice data and metadata via apps. The same speech content was required per dataset. With CSS, participants were asked to utter the sentence “I hope my data can help to manage the virus pandemic” at most three times in their mother tongue, with the majority of samples being uttered in English, Portuguese, Italian, and Spanish. The same speech content was used for the Cambridge set but in English only. With DiCOVA2, participants did number counting from 1 to 10 at a normal pace in English. For all datasets, participants were asked to self-declare whether they were COVID-negative (including healthy or having COVID-like symptoms) or COVID-positive (including symptomatic and asymptomatic cases). It can be noticed from Table I that all three sets contained 10% to 30% asymptomatic COVID-positive cases. Additionally, nearly half of the COVID-negative samples in CSS and Cambridge are symptomatic, which is three times higher than that in DiCOVA2.

The CSS and Cambridge datasets were partitioned into three separate subsets by the challenge organizers, namely training, validation, and test. For comparisons, we employed the same challenge partition in this study. It should be emphasized that in the CSS dataset, several COVID-positive recordings were originally sampled at 8 kHz while the majority of the other files were sampled at 16 kHz. As suggested in [36] and our previous exploration [10], keeping these up-sampled recordings has been shown to lead to overly-optimistic results since classifiers learned to capture the difference in sampling rates instead of the actual pathological pattern. Thus, we removed them from our analysis. The DiCOVA2 dataset, in turn, is comprised of development and evaluation subsets, with the evaluation data being accessible only to challenge participants. Hence, we performed a speaker-independent training-test split (80/20%) using the development subset only and left the evaluation set for testing.

TABLE I: Dataset description and partitions. P-s: Symptomatic COVID-positive. P-a: Asymptomatic COVID-positive. N-s: Symptomatic COVID-negative. N-a: Asymptomatic COVID-negative. N/A: Information not provided.

Dataset	Duration	Language	Symptomatic ratio				Gender		Age			Partition	COVID-label		Total
Dataset	Duration	Language	P-s	P-a	N-s	N-a	Male	Female	$\leq$ 30	30-60	$\geq$ 60	Partition	Pos	Neg	Total
CSS	3.24 hrs	Multi	72%	28%	41%	59%	56%	43%	11%	70%	19%	train	56	243	299
												valid	130	153	283
												test	87	189	266
DiCOVA2	3.93 hrs	EN	87%	13%	14%	86%	74%	26%	43%	52%	5%	develop	137	635	772
DiCOVA2	3.93 hrs	EN	87%	13%	14%	86%	74%	26%	43%	52%	5%	evaluate	35	158	193
Cambridge	5.29 hrs	EN	87%	13%	47%	54%	50%	50%	24%	56%	20%	train	490	530	1020
												valid	82	60	142
												test	162	162	324

IV-B Tasks

As our final goal is to not only provide accurate diagnostics decisions but also ensure the protection of privacy of speaker identity, the evaluation was divided into three tasks. In the first task, we compared the effectiveness and complexity of different anonymization techniques. In Task-2, we then quantified the impact of anonymization techniques on diagnostics accuracy in different conditions. Finally, we provided explanations for the impact seen in Task-2, and explored solutions for improving the proposed systems.

IV-B1 Task-1: Evaluating anonymization performance

As is shown in Figure 4, for each speech recording, the speaker embeddings were extracted separately from the original version, the McAdams-anonymized version, the Ling-GAN anonymized version, and the Ling-Pros-GAN anonymized version. Cosine similarity was then computed between the embeddings of each two signals, where higher cosine similarity values represented higher resemblance between two speech samples. Meanwhile, we employed the pre-trained ECAPA-TDDN speaker verification model from SpeechBrain [62] to detect if two recordings are from the same speaker, then evaluated the misclassification rate, where higher values suggest more successful anonymization. Since multiple evaluation scenarios were considered in this study, where training and test data were processed with different anonymization methods, the cosine similarity and the misclassification rate were computed between not only the clean and anonymized data, but also data processed by different anonymization methods. Additionally, we measured the computation time spent by the three methods per recording, and calculated the average and standard deviation for each dataset. This helps to quantify and compare the time efficiency of the three anonymization methods.

IV-B2 Task-2: Evaluating diagnostics accuracy

As aforementioned in Section I, training and test data could be anonymized using different methods. To mimic a realistic setting, we explore four different scenarios, as detailed below. Table II summarizes these conditions.
Scenario-A: Unprotected: Here, both training and test data are original, thus anonymization is not performed. This encompasses the traditional diagnostic system evaluation and serves as a baseline of the maximum diagnostics accuracy that can be achieved by each model.
Scenario-B: (Anonymization) Ignorant: In this scenario, the training data are original, and only the test data are anonymized. This scenario can be further separated into three cases: test data are anonymized using the McAdams coefficient (scenario B1), the Ling-GAN (scenario B2), and the Ling-Pros-GAN (scenario B3). This scenario exemplifies the case where new anonymization methods are proposed and tested against legacy original diagnostic systems.
Scenario-C: Semi-informed: In this scenario, anonymized data are seen during training, but from a method different from that used for testing. Six combinations were possible out of the three systems, namely: training set comprised of McAdams coefficient anonymization and test set with Ling-GAN (C1), training set with McAdams anonymizer and test set with Ling-Pros-GAN (C2), training set with Ling-GAN and test set with McAdams anonymizer (C3), training set with Ling-GAN and test with Ling-Pros-GAN (C4), training set with Ling-Pros-GAN and test with McAdams anonymizer (C5), and training set with Ling-Pros-GAN and test with Ling-GAN (C6). This scenario exemplifies the case where new anonymization methods are proposed and tested against legacy or different anonymized systems.
Scenario-D: Fully-informed: In this setting, training and test data are both anonymized using the same method and parameters, with three cases: both are anonymized with the McAdams coefficient (D1) method, both with Ling-GAN (D2), and both with Ling-Pros-GAN (D3).

TABLE II: Training/test set details for the different conditions and scenarios explored.

Scenarios	Sub-condition	Training anonym.	Test anonym.
Unprotected	A	Clean	Clean
Ignorant	B1	Clean	McAdams Coefs
	B2	Clean	Ling-GAN
	B3	Clean	Ling-Pros-GAN
Semi-informed	C1	McAdams Coef	Ling-GAN
	C2	McAdams Coef	Ling-Pros-GAN
	C3	Ling-GAN	McAdams Coefs
	C4	Ling-GAN	Ling-Pros-GAN
	C5	Ling-Pros-GAN	McAdams Coefs
	C6	Ling-Pros-GAN	Ling-GAN
Fully-informed	D1	McAdams Coefs	McAdams Coefs
	D2	Ling-GAN	Ling-GAN
	D3	Ling-Pros-GAN	Ling-Pros-GAN

As data distributions vary across datasets [72], diagnostics performance obtained under within-dataset conditions may lack external validity and has been shown to be over-optimistic [72, 41, 44]. To ensure the generalizability of the tested methods, for each scenario we explore both within- and cross-dataset results. In the latter, models are trained on one dataset and tested on data from another set. As the CSS is a subset of the Cambridge set, we avoid using both datasets in the cross-database condition to avoid overly-optimistic results [55, 36].

IV-B3 Task-3: Anonymization for data augmentation

Data augmentation has been widely used in speech applications based on deep neural networks to improve accuracy, especially under mismatched train-test data distributions. One of the approaches to increase model generalizability is to use external data augmentation, which refers to the case where data from external datasets curated for similar tasks are pooled with the in-domain data to increase training sample size [73, 74]. In our case, we aim to improve the generalizability of diagnostic models to samples anonymized by unknown algorithms. We propose to combine anonymized external data with the original data as an augmentation approach to mitigate the degradation caused by anonymization. We focus on two cases, namely augmenting the ignorant and semi-informed scenarios, in which we observed diagnostic models performed the worst. As shown in Figure 5, we experimented with four versions of the augmentation data, including the clean version (i.e., not anonymized), the McAdams-anonymized version, and the two GAN-anonymized versions. For simplicity, we sampled training and test data only from DiCOVA2 and used one of the other two datasets as external data. Since the performance of SVM and PCA-SVM were highly correlated, here we report only the improvement achieved with the openSMILE+SVM, MSR+SVM, and the LogMelSpec+BiLSTM diagnostic systems.

IV-C Training and Inference Strategies

IV-C1 Training

For the systems that rely on hand-crafted features, training data normalization was achieved by removing the mean and scaling to unit variance. The fitted scaler was then applied to the validation and test data. Hyper-parameters were tuned on the held-out validation set. The optimal SVM regularization parameter was searched between $1e^{-5}$ and 1; the SVM kernel was set to linear; and the number of PCs was experimented from 100 to 300. To train the BiLSTM classifier, in turn, recordings were first zero-padded to 10-second length to ensure a fixed shape for logmelspec input; the spectrogram was then mean-variance normalized. Each mini-batch was composed of 64 samples with random shuffling, forced to contain both COVID-positive and COVID-negative samples. Unlike [12], no oversampling of minority class or any other data augmentation techniques were used, as their effect on anonymization has yet to be quantified. The following hyper-parameters were used for training: Binary Cross Entropy (BCE) loss; Adam optimizer with an initial learning rate of $1e^{-4}$ and ${\textit{l}_{2}}$ regularization set to $1e^{-4}$ . During the validation phase, an initial patience factor was set to 5 and reduced by 1 if the validation score did not increase. Training stopped whenever the patience factor reduced to 0, the number of training epochs was saved for the inference phase.

IV-C2 Inference

For the first four systems, the pre-trained model with the highest validation score was then used for testing. As the BiLSTM classifier is more data-hungry, the optimal hyper-parameters found in the training phase were then used to train the classifier from scratch with the aggregated training and validation data. The number of training epochs maintained the same as that saved in the training phase.

IV-D Evaluation Metrics

Since all three datasets are imbalanced, the area under the receiver-operating-characteristic curve (AUC-ROC) was chosen as the primary metric to measure the diagnostics accuracy. We further calculated the 95% confidence intervals (CIs) using 1000 $\times$ bootstrap with replacement on the test set. According to [75], CIs can reflect the variability of diagnostics accuracy when the model is applied to a different population.

As mentioned previously, cosine similarity and misclassification rate were used to quantify the effectiveness of the three anonymization methods. The similarity scores are averaged across samples from all three datasets per method, where 0 represents no resemblance and 1 represents a perfect match between two tested speech conditions. While the Equal Error Rate (EER) is commonly used to evaluate anonymization efficacy, ground-truth speaker identifiers are required for each recording in order to verify if samples are from the same or different speakers. However, speaker identifiers were not available for the CSS and DiCOVA2 datasets. Instead, we rely on misclassification rate by employing a pre-trained speaker verification model from [62]. For each recording, the model outputs a binary decision (yes/no) if it believes a pair of anonymized and clean speech signals come from the same speaker. The misclassification rate is then calculated by dividing the number of misclassified pairs over the total number of pairs per method, which reflects the percentage of successfully anonymized recordings. For an ideal anonymization system, the misclassification rate is expected to be 100%, i.e., the model should decide that all anonymized signals do not come from the same speaker as the clean signal counterpart. Lastly, computation time was recorded for each anonymization method, including the loading and exporting of the audio files. In the case of the two GAN methods, the loading time of the model itself was not taken into account in this computation.

V Experimental Results and Discussion

V-A Task-1: Anonymization Results

The average cosine similarity scores between the speaker embeddings of speech files anonymized by the different methods together with the misclassification rates are shown in Figure 6. As can be seen, near perfect anonymization performance was achieved with both GAN-based methods (misclassification rates), with almost no similarity with either the original speech or the speech anonymized by other methods. On the other hand, nearly half of the McAdams-anonymized samples can be successfully detected, suggesting some speaker-unique information still remained.

The computational complexity of the three anonymization methods is presented in Table III for all three datasets. While the GAN-based methods are shown to provide better anonymization effectiveness, it requires computational times approximately 10-20 times longer than using the McAdams coefficient method. The longest time was seen with Ling-Pros-GAN, since it requires extra time to extract prosody, which involves an online training loop, and to generate and find embeddings in real-time. As model loading time was not taken into account, the computation footprint of the GAN-based methods could be larger in real-world settings. Additionally, the GAN-based methods rely on several pre-trained neural networks with millions of parameters (e.g., 22.3 million for ECAPA-TDNN embedding extractor; 10 million for the generator), which could make it challenging to be deployed on mobile devices.

TABLE III: Average computation time per speech file (second) with standard deviations using different anonymization methods for the three datasets.

Method	CSS	DiC	Cam	Ave
McAdams Coef	0.87±0.10	1.15±0.13	0.88±0.92	0.97
Ling-GAN	8.52±2.93	10.22±3.56	9.58±2.70	9.44
Ling-Pros-GAN	26.49±20.53	24.9±11.61	19.47±11.61	23.62

V-B Task-2: Within-dataset Performance

The within-dataset performance of the five diagnostics systems under different anonymization scenarios is demonstrated in Figure 7. As can be seen from the average AUC-ROC scores per scenario, the highest performance is achieved under scenario A, i.e., when anonymization is not performed. When the test data are anonymized using the McAdams coefficient (scenario B1), the average AUC-ROC score over all systems dropped by 8.9% (CSS), 5.9% (DiCOVA2), and 6.3% (Cambridge) relative to scenario A. A substantial decrease was observed when using both the Ling-GAN and Ling-Pros-GAN anonymizers (scenario B2 and B3), where an average relative drop 22.5% and 18.1% was achieved respectively. Moreover, nearly all systems degraded to chance levels under scenario C where models were trained with data anonymized by one method and tested with data anonymized with another, suggesting that anonymization may drastically remove COVID-19 speech information. Diagnostic performance in the fully-informed scenarios is shown to be close to scenario A. Among the three anonymizers, McAdams anonymization leads to higher diagnostic performance on average in scenario D. Compared to the Ling-GAN, Ling-Pros-GAN shows higher performance on the English datasets (DiCOVA2 and Cambridge) and lower performance on the multilingual one (CSS).

Next, we evaluate the sensitivity of different diagnostics systems to anonymization and explore the relative drop in accuracy from scenario A to scenario B. Table IV reports the average drops seen per dataset. As can be seen, the two GAN-based methods resulted in a substantially higher degradation relative to the McAdams coefficient method, with the Ling-GAN leading to the most severe decrease. This was expected and corroborates Task-1 results, where speaker embeddings of the GAN-anonymized speech showed practically no similarity to the original speech. Meanwhile, since Ling-Pros-GAN leaves the prosody intact and generates more COVID-like embeddings, it is likely to preserve more COVID-19 attributes than the Ling-GAN, thus rendering higher anonymized diagnostic performance. Previous studies have shown that speaker embeddings (e.g., x-vector) also contain other nonverbal information and can be used for speech para-linguistic tasks [76, 77], such as speech emotion recognition [78] and disease detection [79, 80]. While the GAN-based anonymizers substitute the original speaker embedding with a dissimilar speaker embedding, the obtained results suggest that health-related vocal characteristics are likely also discarded, thus resulting in significant drops in diagnostics accuracy.

TABLE IV: Drop in within-dataset AUC-ROC (%) from scenario A to scenario B for different anonymization methods.

Anonymization method	CSS	DiC	Cam	Ave
McAdams coefficient	8.9	5.9	6.3	7.0
Ling-GAN	27.3	30.5	9.8	22.5
Ling-Pros-GAN	25.2	20.4	8.7	18.1

Lastly, we use scenario A as the baseline and calculate the average drop in accuracy for scenario C, showing the impact that training models completely on anonymized data would have. For both openSMILE and MSR methods, we use the PCA-SVM pipeline to avoid the effects of difference in the number of features. The comparative results are reported in Table V. As can be seen, all three diagnostic systems show degradaded performance, with the logmelspec+BiLSTM system shown to be on average more robust (21.6%) to the semi-informed anonymization scenario. Notwithstanding, it should be highlighted that the logmelspec+BiLSTM system achieved the lowest AUC-ROC in scenario A. Interestingly, with the CSS dataset, the diagnostic system based on a BiLSTM and log-mel spectrogram input resulted in substantially lower degradation percentage compared to the two other systems based on traditional engineered features and classifiers. CSS is a multilingual dataset, thus hand-crafted features (e.g., syllabic rate, speech production features) used in these models may show more sensitivity to language.

TABLE V: Drop in within-dataset AUC-ROC (%) from scenario A to the average of all sub-conditions under scenario C for different diagnostics systems.

Diagnostics system	CSS	DiC	Cam	Ave
openSMILE+PCA-SVM	31.7	26.8	6.7	21.7
MSR+PCA-SVM	27.7	32.4	16.7	25.6
LogMelSpec+BiLSTM	18.8	34.0	12.1	21.6

V-C Task-2: Cross-dataset Performance

Figure 8 shows the cross-dataset performance under the thirteen different testing scenarios. In line with previous studies [44, 81], all five diagnostics systems demonstrated significantly lower performance relative to within-dataset results; the logmelspec+BiLSTM achieved the greatest drop in performance. Interestingly, in a few scenarios anonymization helped systems become more generalizable relative to the unprotected setting (e.g., scenarios B2 and C3 for the CSS-DiCOVA2 cross-database experiment). Figure 9 depicts the average change in accuracy relative to scenario A for all scenarios and diagnostic systems. While on average a 6.6% drop in accuracy was seen across all five systems, an increase of 2% and 5% was achieved with MSR+SVM and logmelspec+BiLSTM systems for scenarios C4 and C2, respectively. It is important to note that both scenarios involved GAN-based anonymized test data, thus had typically the lower cross-dataset results to start off with.

V-D Explaining the degradation caused by different anonymizers

While our study shows that typical anonymization systems lead to degraded diagnostic performance, it is unclear why different systems caused different levels of degradation and why some diagnostic models could still perform decently after anonymization. To answer these questions, we performed a comprehensive evaluation of the impact of different speech aspects on diagnostic performance, including the linguistic content, speaker representation, and prosody. Similar to the experimental setup of Task-1, we now compare the within-dataset performance obtained by three categories of speech features, namely (1) the phoneme-level features, including the number of mispronunciations (as opposed to the speech script), number of pauses, and number of phonemes uttered per second; (2) the speaker representation extracted by concatenating the pre-trained x-vector and ECAPA-TDNN embeddings [51]; and (3) prosodic features, such as the low-level descriptors of the F0 contour.

A Linear Discriminant Analysis (LDA) classifier is applied on top of each of the feature sets for classification. The results achieved by these features are reported in Table VI. Among the three feature sets, speaker embeddings appear to be the most crucial features for all datasets, corroborating with Task-1 results where the GANs suffered the most severe degradation, where the original speaker embeddings were entirely substituted. Such finding also suggests that speaker-unique attributes and health-related information are highly entangled in the speaker embeddings. Considering that existing anonymization systems rely heavily on these off-the-shelf speaker embeddings, it remains challenging to preserve the health information while altering only the speaker identifier.

While a group of studies reported prosody as a key biomarker to characterize speech disorders, such as dysarthria [82, 83, 84], our results show that phoneme-level linguistic features outperform prosodic features for COVID-19 detection. Specifically, we found the number of pauses and number of mispronunciations to be the most important phoneme-level features, with COVID-positive samples demonstrating more mispronunciations and fewer pauses. While the correlation between phoneme-level features and COVID-19 status has not been systematically studied, similar features have been examined for other diseases affecting speech production. For example, [85] shows that individuals with Parkinson’s disease produced fewer pauses at syntactic boundaries; the statistics of pauses have been shown crucial for diagnosing neuromuscular disorders, such as dysarthria [86]. Since GAN-based systems left linguistic content intact during anonymization, these findings help explain why the diagnostic models could perform above chance-level even when only the phoneme sequences were preserved during anonymization.

TABLE VI: Diagnostic performance achieved by different categories of speech features

Feature	AUC-ROC
Feature	CSS	DiCOVA2	Cambridge
Linguistic	.561	.632	.555
Speaker	.739	.697	.571
Prosodic	.541	.564	.520

V-E Visualizing speech processed by different anonymizers

To better understand the impact of different anonymization methods on speech characteristics, we first visualize the waveform of the speech processed by the three anonymizers (see Fig. 10) for a direct comparison. As can be seen, those processed by the McAdams anonymizer and Ling-Pros-GAN share higher similarities in the waveform envelope shape with the original signal compared to the one generated by Ling-GAN. The difference seen in the plot is in line with the architecture design of different anonymizers. Among the three, Ling-GAN loses prosody and most of the speaker attributes, hence is expected to cause the highest amount of changes in the anonymized speech. The Ling-Pros-GAN and McAdams anonymizer, in turn, leave the speech rhythm untouched (i.e., duration and energy of phonemes), hence leading to higher resemblance in the waveform envelope.

Next, t-SNE plots are used to visualize the distribution of the speech features in two dimensions. Figure 11 shows the clusters of speech anonymized with different methods (computed from the training and validation data) and for the three features modalities explored herein: openSMILE (subplots a), MSR (b), and logmelspec (c). As can be seen, for all three feature sets, the distribution of clean speech (blue) is closer to that of the McAdams anonymized speech (orange) and Ling-Pros-GAN anonymized speech (red), while the Ling-GAN anonymized speech (green) shows the least similarity with the other two, corroborating findings from Tasks 1 and 2.

Moreover, it can be seen from Figure 11a and Figure 11b that the clusters computed from openSMILE and MSR features show little overlap, while clusters of the logmelspec features show great overlap (Figure 11c). Together with Task-2 results, this shift in the feature space is likely the main cause of the higher decrease observed in the openSMILE and MSR systems under different anonymization settings. Meanwhile, since all anonymization methods keep the speech content intact and change only the nonverbal attributes, a greater shift in feature space may indicate a stronger correlation with the para-linguistic aspect and less with the linguistic aspect. This echoes with previous studies which showed that openSMILE and MSR features are preferred over logmelspec features in characterizing emotional and unnatural speech [87, 88].

V-F Task-3: Improving Diagnostics Performance with Data Augmentation

Lastly, we investigate the impact of using anonymized external data for data augmentation and see its impact on the performance achieved with scenarios B and C. With scenario C, we chose sub-condition C1 and C3. To quantify the relative improvement, we used the within-dataset performance achieved in scenario A as the baseline and calculated the amount of performance increase seen (in percentage). The relative changes observed with the three diagnostics systems are reported in Table VII. Here, we explore augmentation with two different datasets and with four different methods: original, McAdams, Ling-GAN, and Ling-Pros-GAN. As can be seen, when test data are anonymized using the McAdams coefficient (B1), the highest improvement is generally achieved when the diagnostic system is augmented with the original data. In turn, when the test data are anonymized using the GAN-based method (B2), augmenting the set with GAN-anonymized data from another dataset leads to a higher increase. Similar results are shown in scenarios C1 and C2, where clean and GAN-anonymized augment data result in more significant improvements. While not the top-performer, the McAdams method is shown to be a reliable augmentation strategy, especially for the openSMILE features. Overall, these findings suggest that anonymization has the potential to be used as a data augmentation approach to improve COVID-19 diagnostics accuracy when tested on anonymized data.

TABLE VII: Change of AUC-ROC scores achieved in scenarios B and C after data augmentation (given in %). Bold values indicate the highest improvement with each diagnostics system under a given scenario.

Scen.	Aug. data	Augment	Diagnostics System
Scen.	Aug. data	Augment	openSMILE	MSR	logmelspec
B1	CSS	Clean	0.6	9.9	-30.6
		McAdams	-5.0	2.7	-23.7
		Ling-GAN	-2.4	5.2	-7.9
		Ling-Pros-GAN	-5.0	-9.5	-8.9
	Cam	Clean	-1.5	-6.4	-9.7
		McAdams	-12.1	-23.7	0.1
		Ling-GAN	-14.1	-2.2	5.7
		Ling-Pros-GAN	-1.2	-10.7	-1.1
B2	CSS	Clean	21.3	3.9	-7.9
		McAdams	3.8	15.0	-15.7
		Ling-GAN	24.5	25.5	-1.6
		Ling-Pros-GAN	4.1	4.9	-5.3
	Cam	Clean	21.5	31.0	-7.9
		McAdams	4.8	16.6	1.8
		Ling-GAN	16.1	23.2	18.6
		Ling-Pros-GAN	-5.4	4.6	3.9
C1	CSS	Clean	14.4	-8.8	11.7
		McAdams	-2.2	-6.7	-6.6
		GAN	16.6	-0.4	8.2
		Ling-Pros-GAN	0.0	9.5	3.8
	Cam	Clean	18.7	5.0	-0.8
		McAdams	1.9	-30.5	-9.6
		Ling-GAN	7.4	-10.1	2.9
		Ling-Pros-GAN	-21.0	-14.0	1.3
C3	CSS	Clean	5.9	11.1	-28.3
		McAdams	5.3	-13.9	-10.8
		Ling-GAN	2.0	26.2	-5.4
		Ling-Pros-GAN	14.5	-5.4	-3.7
	Cam	Clean	15.7	-8.7	-8.5
		McAdams	15.5	-0.4	-16.6
		Ling-GAN	1.5	-20.7	5.6
		Ling-Pros-GAN	22.0	-4.7	2.1

V-G Limitations, Biases, and Future Work

The study’s principal aim was to validate the effectiveness of anonymization methods within and across datasets in the context of assessing voice-based COVID-19 diagnostic accuracy. While we investigated three anonymization methods, other methods are emerging continuously (e.g., [89, 90]); thus, the findings reported herein should be validated with more recent methods. In the present study, the ASR anonymization method developed on English speech was applied to the multilingual CSS dataset. The finding that GAN-based anonymization had the lowest cross-dataset performance results may suggest challenges in applying this method in multilingual datasets and non-English speaking populations. In the future, multilingual GANs should be explored to avoid unfair outcomes [91] due to certain languages or cultural settings being excluded from the training and testing datasets. Moreover, while the injection of anonymized external data showed to be a useful data augmentation strategy, the final results were still at times lower than those achieved in the classical “unprotected” setting. This suggests that health-related information is being discarded during the anonymization process, thus future work could explore the development of diagnostic-aware anonymization methods that keep such discriminatory information intact.

Beyond tackling these limitations mentioned above, future work into voice-based diagnostics should be mindful of potential biases during data collection that could lead to confounds for both the anonymization and diagnostic steps. These confounders, if not properly dealt with, can reinforce the systemic nature of biases, for instance in relation to gender and racioethnic groups, that already exist within the healthcare system, thus transferring them to automated diagnostic systems. While [35] already showed some impact of sampling rate on diagnostic accuracy, several other potential biases may exist at the methodological level. For example, sociodemographic biases may emerge if age is not taken into consideration, as cognitive limitations (e.g., difficulty in speech planning or lexical access) associated with aging could alter speech patterns that could affect overall diagnostic accuracy. Recent work has shown that socioeconomic status could serve as a bias in COVID-19 detection [92]. For example, as data was collected from participants from home, those living in crowded conditions could have resulted in increased background noise levels that negatively affected anonymization and diagnostic efficacy. Moreover, disadvantaged populations have been shown to have more chronic respiratory diseases [93] and higher levels of mood disorders and psychological distress [94]. As such, anonymization processes affecting para-linguistic features associated with depressed mood may disproportionately affect those with a low socioeconomic status. Addressing biases in automated voice anonymization and diagnosis systems is beyond the scope of this paper and is left for future studies.

VI Conclusion

In this study, we comprehensively evaluated the impact of three voice anonymization methods on the accuracy of five leading COVID-19 detection systems as well as the anonymization efficacy. All anonymization methods showed to degrade diagnostics accuracy, where the most severe degradation was seen with the systems that directly altered speaker embeddings. Our findings suggest that existing methods lack the capability of effectively preserving diagnostic information while obfuscating speaker identifiers. Lastly, we explored the use of anonymized external data as a data augmentation tool and promising results were obtained.

Acknowledgments

The authors would like to thank the developers of the CSS, DiCOVA2, and Cambridge datasets for making them available to the community for research purposes. The developers of the datasets do not bear any responsibility for the analysis and results presented in this paper. All results and interpretations only represent the view of the authors. The authors acknowledge funding from INRS and NSERC.

References

[1] M. B. Hoy, “Alexa, siri, cortana, and more: an introduction to voice assistants,” Medical reference services quarterly, vol. 37, no. 1, pp. 81–88, 2018.
[2] R. Togneri and D. Pullella, “An overview of speaker identification: Accuracy and robustness issues,” IEEE Circuits and Systems Magazine, vol. 11, no. 2, pp. 23–61, 2011.
[3] K. K. Lella and A. PJA, “A literature review on COVID-19 disease diagnosis from respiratory sound data,” arXiv preprint arXiv:2112.07670, 2021.
[4] V. Nathan, K. Vatanparvar, M. M. Rahman, E. Nemati, and J. Kuang, “Assessment of chronic pulmonary disease patients using biomarkers from natural speech recorded by mobile devices,” in 2019 IEEE 16th International Conference on Wearable and Implantable Body Sensor Networks (BSN). IEEE, 2019, pp. 1–4.
[5] U. Petti, S. Baker, and A. Korhonen, “A systematic literature review of automatic alzheimer’s disease detection from speech and language,” Journal of the American Medical Informatics Association, vol. 27, no. 11, pp. 1784–1797, 2020.
[6] P. F. Macneilage, “Speech production,” Language and Speech, vol. 23, no. 1, pp. 3–23, 1980.
[7] P. Vetter, D. Vu, A. L’Huillier, M. Schibler, L. Kaiser, and F. Jacquerioz, “Clinical features of COVID-19,” 2020.
[8] T. Quatieri, T. Talkar, and J. Palmer, “A framework for biomarkers of COVID-19 based on coordination of speech-production subsystems,” IEEE Open J. Engineering in Medicine and Biology, vol. 1, pp. 203–206, 2020.
[9] V. K. Paliwal, R. K. Garg, A. Gupta, and N. Tejan, “Neuromuscular presentations in patients with COVID-19,” Neurological Sciences, vol. 41, no. 11, pp. 3039–3056, 2020.
[10] Y. Zhu and T. H. Falk, “Fusion of modulation spectral and spectral features with symptom metadata for improved speech-based COVID-19 detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8997–9001.
[11] B. W. Schuller, A. Batliner, C. Bergler, C. Mascolo, J. Han, I. Lefter, H. Kaya, S. Amiriparian, A. Baird, L. Stappen et al., “The interspeech 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates,” arXiv preprint arXiv:2102.13468, 2021.
[12] N. K. Sharma, S. R. Chetupalli, D. Bhattacharya, D. Dutta, P. Mote, and S. Ganapathy, “The second dicova challenge: Dataset and performance analysis for diagnosis of COVID-19 using acoustics,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 556–560.
[13] J. Wang, B. Cao, P. Yu, L. Sun, W. Bao, and X. Zhu, “Deep learning towards mobile applications,” in 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2018, pp. 1385–1393.
[14] C. Stupp, “Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case,” The Wall Street Journal, vol. 30, no. 08, 2019.
[15] N. Kaloudi and J. Li, “The ai-based cyber threat landscape: A survey,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–34, 2020.
[16] M. M. Yamin, M. Ullah, H. Ullah, and B. Katt, “Weaponized ai for cyber attacks,” Journal of Information Security and Applications, vol. 57, p. 102722, 2021.
[17] U. Iqbal, P. N. Bahrami, R. Trimananda, H. Cui, A. Gamero-Garrido, D. Dubois, D. Choffnes, A. Markopoulou, F. Roesner, and Z. Shafiq, “Your echos are heard: Tracking, profiling, and ad targeting in the amazon smart speaker ecosystem,” arXiv preprint arXiv:2204.10920, 2022.
[18] H. Jin and S. Wang, “Voice-based determination of physical and emotional characteristics of users,” Oct. 9 2018, uS Patent 10,096,319.
[19] S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. Younis, “Speech technology for healthcare: Opportunities, challenges, and state of the art,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2020.
[20] B. T. Harel, M. S. Cannizzaro, H. Cohen, N. Reilly, and P. J. Snyder, “Acoustic characteristics of parkinsonian speech: a potential biomarker of early disease progression and treatment,” Journal of Neurolinguistics, vol. 17, no. 6, pp. 439–453, 2004.
[21] D. M. Low, K. H. Bentley, and S. S. Ghosh, “Automated assessment of psychiatric disorders using speech: A systematic review,” Laryngoscope investigative otolaryngology, vol. 5, no. 1, pp. 96–116, 2020.
[22] T. Z. Zarsky, “Incompatible: The gdpr in the age of big data,” Seton Hall L. Rev., vol. 47, p. 995, 2016.
[23] I. Calzada, “Citizens’ data privacy in china: The state of the art of the personal information protection law (pipl),” Smart Cities, vol. 5, no. 3, pp. 1129–1150, 2022.
[24] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien et al., “The voiceprivacy 2020 challenge: Results and findings,” Computer Speech & Language, vol. 74, p. 101362, 2022.
[25] N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J. F. Bonastre, “The voiceprivacy 2022 challenge evaluation plan,” arXiv preprint arXiv:2203.12468, 2022.
[26] F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker anonymization using x-vector and neural waveform models,” arXiv preprint arXiv:1905.13561, 2019.
[27] S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker anonymization with phonetic intermediate representations,” arXiv preprint arXiv:2207.04834, 2022.
[28] H. Nourtel, P. Champion, D. Jouvet, A. Larcher, and M. Tahon, “Evaluation of speaker anonymization on emotional speech,” in SPSC 2021-1st ISCA Symposium on Security and Privacy in Speech Communication, 2021.
[29] S. H. Dumpala, R. Uher, S. Matwin, M. Kiefte, and S. Oore, “Sine-wave speech and privacy-preserving depression detection,” in Proc. SMM21, Workshop on Speech, Music and Mind, vol. 2021, 2021, pp. 11–15.
[30] C. H. Lee and H.-J. Yoon, “Medical big data: promise and challenges,” Kidney research and clinical practice, vol. 36, no. 1, p. 3, 2017.
[31] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proc. ACM international conference on Multimedia, 2010, pp. 1459–1462.
[32] T. Warnita, N. Inoue, and K. Shinoda, “Detecting alzheimer’s disease using gated convolutional neural network from audio data,” arXiv preprint arXiv:1803.11344, 2018.
[33] V. S. Nallanthighal, A. Härmä, and H. Strik, “Detection of copd exacerbation from speech: comparison of acoustic features and deep learning based speech breathing models,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9097–9101.
[34] J. Han, K. Qian, M. Song et al., “An early study on intelligent analysis of speech under COVID-19: Severity, sleep quality, fatigue, and anxiety,” arXiv:2005.00096, 2020.
[35] J. Han, T. Xia, D. Spathis, E. Bondareva, C. Brown, J. Chauhan, T. Dang, A. Grammenos, A. Hasthanasombat, A. Floto et al., “Sounds of COVID-19: exploring realistic performance of audio-based digital testing,” NPJ digital medicine, vol. 5, no. 1, pp. 1–9, 2022.
[36] H. Coppock, A. Akman, C. Bergler, M. Gerczuk, C. Brown, J. Chauhan, A. Grammenos, A. Hasthanasombat, D. Spathis, T. Xia et al., “A summary of the compare COVID-19 challenges,” arXiv preprint arXiv:2202.08981, 2022.
[37] T. H. Falk and W.-Y. Chan, “Modulation spectral features for robust far-field speaker identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 90–100, 2009.
[38] A. R. Avila, Z. Akhtar, J. F. Santos, D. O’Shaughnessy, and T. H. Falk, “Feature pooling of modulation spectrum features for improved speech emotion recognition in the wild,” IEEE Transactions on Affective Computing, vol. 12, no. 1, pp. 177–188, 2018.
[39] A. Tiwari, R. Cassani, S. Kshirsagar, D. P. Tobon, Y. Zhu, and T. H. Falk, “Modulation spectral signal representation for quality measurement and enhancement of wearable device data: A technical note,” Sensors, vol. 22, no. 12, p. 4579, 2022.
[40] T. H. Falk, W.-Y. Chan, E. Sejdic, and T. Chau, “Spectro-temporal analysis of auscultatory sounds,” New Developments in Biomedical Engineering, pp. 93–104, 2010.
[41] A. Akman, H. Coppock, A. Gaskell, P. Tzirakis, L. Jones, and B. W. Schuller, “Evaluating the COVID-19 identification resnet (cider) on the interspeech covid-19 from audio challenges,” arXiv preprint arXiv:2107.14549, 2021.
[42] G. Deshpande and B. W. Schuller, “Audio, speech, language, & signal processing for COVID-19: A comprehensive overview,” arXiv preprint arXiv:2011.14445, 2020.
[43] B. W. Schuller, D. M. Schuller, K. Qian, J. Liu, H. Zheng, and X. Li, “COVID-19 and computer audition: An overview on what speech & sound analysis could contribute in the sars-cov-2 corona crisis,” Frontiers in digital health, vol. 3, p. 564906, 2021.
[44] Y. Zhu, A. Mariakakis, E. De Lara, and T. H. Falk, “How generalizable and interpretable are speech-based COVID-19 detection systems?: A comparative analysis and new system proposal,” in 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 2022, pp. 1–5.
[45] Y. Stylianou, “Voice transformation: a survey,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3585–3588.
[46] B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent, “Evaluating voice conversion-based privacy protection against informed attackers,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2802–2806.
[47] S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
[48] S. E. McAdams, Spectral fusion, spectral parsing and the formation of auditory images. Stanford university, 1984.
[49] J. Qian, H. Du, J. Hou, L. Chen, T. Jung, X.-Y. Li, Y. Wang, and Y. Deng, “Voicemask: Anonymize and sanitize voice input on mobile devices,” arXiv preprint arXiv:1711.11460, 2017.
[50] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[51] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
[52] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
[53] B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for x-vector based speaker anonymization,” arXiv preprint arXiv:2005.08601, 2020.
[54] S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker Anonymization with Phonetic Intermediate Representations,” in Proc. Interspeech 2022, 2022, pp. 4925–4929.
[55] T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthanasombat, E. Bondareva, T. Dang, A. Floto, P. Cicuta et al., “COVID-19 sounds: a large-scale audio dataset for digital respiratory screening,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[56] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[57] J. F. Santos, M. Senoussaoui, and T. H. Falk, “An improved non-intrusive intelligibility metric for noisy and reverberant speech,” in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2014, pp. 55–59.
[58] J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the mcadams coefficient,” in Interspeech 2021. ISCA, 2021, pp. 1099–1103.
[59] D. O’Shaughnessy, “Linear predictive coding,” IEEE potentials, vol. 7, no. 1, pp. 29–32, 1988.
[60] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[61] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[62] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
[63] H. Liu, X. Gu, and D. Samaras, “Wasserstein gan with quadratic transport cost,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4832–4841.
[64] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
[65] S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody is not identity: A speaker anonymization approach using prosody cloning,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[66] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 5180–5189.
[67] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
[68] L. Orlandic, T. Teijeiro, and D. Atienza, “The coughvid crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms,” Scientific Data, vol. 8, no. 1, p. 156, 2021.
[69] D. T. Pizzo and S. Esteban, “Iatos: Ai-powered pre-screening tool for COVID-19 from cough audio samples,” arXiv preprint arXiv:2104.13247, 2021.
[70] G. Chaudhari, X. Jiang, A. Fakhry, A. Han, J. Xiao, S. Shen, and A. Khanzada, “Virufy: Global applicability of crowdsourced and clinical datasets for ai detection of COVID-19 from cough,” arXiv preprint arXiv:2011.13320, 2020.
[71] M. Cohen-McFarlane, R. Goubran, and F. Knoefel, “Novel coronavirus cough database: Nococoda,” Ieee Access, vol. 8, pp. 154 087–154 094, 2020.
[72] M. Roberts, D. Driggs, M. Thorpe, J. Gilbey, M. Yeung, S. Ursprung, A. I. Aviles-Rivero, C. Etmann, C. McCague, L. Beer et al., “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and ct scans,” Nature Machine Intelligence, vol. 3, no. 3, pp. 199–217, 2021.
[73] Z. Deng, L. Zhang, A. Ghorbani, and J. Zou, “Improving adversarial robustness via unlabeled out-of-domain data,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2845–2853.
[74] S. Shahnawazuddin, W. Ahmad, N. Adiga, and A. Kumar, “In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7554–7558.
[75] R. W. Platt, J. A. Hanley, and H. Yang, “Bootstrap confidence intervals for the sensitivity of a quantitative diagnostic test,” Statistics in medicine, vol. 19, no. 3, pp. 313–322, 2000.
[76] D. Raj, D. Snyder, D. Povey, and S. Khudanpur, “Probing the information encoded in x-vectors,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 726–733.
[77] R. J. van Son et al., “Measuring voice quality parameters after speaker pseudonymization.” in Interspeech, 2021, pp. 1019–1023.
[78] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7169–7173.
[79] L. Moro-Velazquez, J. Villalba, and N. Dehak, “Using x-vectors to automatically detect parkinson’s disease from speech,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1155–1159.
[80] R. Pappagari, J. Cho, L. Moro-Velazquez, and N. Dehak, “Using state of the art speaker recognition and natural language processing technologies to detect alzheimer’s disease and assess its severity.” in Interspeech, 2020, pp. 2177–2181.
[81] H. Coppock, A. Gaskell, P. Tzirakis, A. Baird, L. Jones, and B. Schuller, “End-to-end convolutional neural network enables COVID-19 detection from breath and cough audio: a pilot study,” BMJ innovations, vol. 7, no. 2, 2021.
[82] G. Vyas, M. K. Dutta, J. Prinosil, and P. Harár, “An automatic diagnosis and assessment of dysarthric speech using speech disorder specific prosodic features,” in 2016 39th International Conference on Telecommunications and Signal Processing (TSP). IEEE, 2016, pp. 515–518.
[83] K. Kadi, S. Selouani, B. Boudraa, and M. Boudraa, “Discriminative prosodic features to assess the dysarthria severity levels,” in Proceedings of the World Congress on Engineering, vol. 3, 2013.
[84] V. M. Ramos, H. A. K. Hernandez-Diaz, M. E. H.-D. Huici, H. Martens, G. Van Nuffelen, and M. De Bodt, “Acoustic features to characterize sentence accent production in dysarthric speech,” Biomedical Signal Processing and Control, vol. 57, p. 101750, 2020.
[85] M. Darling-White and J. E. Huber, “The impact of parkinson’s disease on breath pauses and their relationship to speech impairment: A longitudinal study,” American Journal of Speech-Language Pathology, vol. 29, no. 4, pp. 1910–1922, 2020.
[86] G. Noffs, T. Perera, S. C. Kolbe, C. J. Shanahan, F. M. Boonstra, A. Evans, H. Butzkueven, A. van der Walt, and A. P. Vogel, “What speech can tell us: A systematic review of dysarthria characteristics in multiple sclerosis,” Autoimmunity reviews, vol. 17, no. 12, pp. 1202–1209, 2018.
[87] T. H. Falk, W.-Y. Chan, and F. Shein, “Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,” Speech Communication, vol. 54, no. 5, pp. 622–631, 2012.
[88] F. Eyben, Real-time speech and music classification by large audio feature space extraction. Springer, 2015.
[89] J. Deng, F. Teng, Y. Chen, X. Chen, Z. Wang, and W. Xu, “V-cloak: Intelligibility-, naturalness-& timbre-preserving real-time voice anonymization,” arXiv preprint arXiv:2210.15140, 2022.
[90] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko, “Speaker anonymization using orthogonal householder neural network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[91] J. Morley, C. C. Machado, C. Burr, J. Cowls, I. Joshi, M. Taddeo, and L. Floridi, “The ethics of ai in health care: a mapping review,” Social Science & Medicine, vol. 260, p. 113172, 2020.
[92] Y. Zhu, M. Imoussaine, C. Côté-Lussier, and T. Falk, “Investigating biases in COVID-19 diagnostic systems processed with automated speech anonymization algorithms,” in Proc. 3rd Symposium on Security and Privacy in Speech Communication, 2023, pp. 46–54.
[93] R. A. Pleasants, I. L. Riley, and D. M. Mannino, “Defining and targeting health disparities in chronic obstructive pulmonary disease,” International journal of chronic obstructive pulmonary disease, pp. 2475–2496, 2016.
[94] A. Drapeau, A. Marchand, and D. Beaulieu-Prevost, “Epidemiology of psychological distress,” Mental illnesses-understanding, prediction and control, vol. 69, no. 2, pp. 105–106, 2012.