An overview of techniques for biomarker discovery in voice signal

Abstract

This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers and allow them to be measured and used for predictive and diagnostic purposes. These approaches include proxy techniques, model-based analytical techniques and data-driven AI techniques.

Index Terms— Biomarker discovery, Voice profiling, Feature engineering, Voice analytics, AI-based diagnostic aids

1 Introduction

Based on their effects on human voice, medical conditions that are known to affect humans can be divided into four clear categories. Of these, one category of diseases comprises those that have absolutely no effect on voice, such as certain dermatological conditions, hair-related conditions etc. In contrast to this, the category of conditions that is expected to have the most obvious effects on voice includes diseases that directly affect the structures of the vocal tract – vocal folds, larynx, glottis, respiratory tract, articulators etc. Examples of such diseases are otolayngological diseases of various etiology. A third category is that of diseases that indirectly affect the processes that drive voice production – including cognitive, neuromuscular, biomechanical and auditory feedback processes. These conditions cause varied effects on voice, ranging from fairly intense and obvious effects, to very subtle or almost imperceptile ones. This category includes diseases such as those listed in Table 1, and syndromes caused by secondary effects of drugs, intoxicants and other harmful substances. The fourth category – and the focus of the mechanisms presented in this paper – comprises diseases for which the existence of voice changes is hypothesized, but these may not be evident through standard analytical examinations of the voice signal. These include disease subcategories that affect intellectual abilities, visual function, temperament, personality, behavior etc. For these diseases, biomarkers in voice may be hypothesized to be present, but are elusive and must be searched for – in effect designed or created – for use in data-driven applications that attempt to detect the presence of these diseases from voice.

The changes in voice that are alluded to above are in fact biomarker patterns, or biomarkers. The term biomarker in this context refers to specific patterns of change(s) in the voice signal in the voice signal, that carry information about the health conditions that cause them. Such changes may be thought of as perturbations or deviations from a hypothetical “normal” voice signal, within its frequency, duration, intensity, amplitude and other entities that characterize it. The changes may be wide-raging and coarse in nature, or temporally transient, occurring within micro-durations of the signal. In other words, biomarker patterns may range from being highly perceptible, to completely imperceptible – to the extent that they may be undetectable even within standard analytical representations of the speech signal.

The rest of this paper introduces three categories of techniques that can potentially uncover such elusive biomarkers. Such techniques can be thought of as biomarker discovery or feature engineering techniques focused on the design or re-design of biomarker features in voice. The approaches described below include proxy techniques, model-based analytical techniques and data-driven AI techniques.

“© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”

Health condition	How it affects voice
Attention Deficit Hyperactivity Disorder (ADHD)	Prosodic variations in loudness and fundamental frequency [1]
Amyotrophic Lateral Sclerosis (ALS)	Voice tremor, flutter [2], incomplete vocal fold closure, dysarthria [3]; Dystonia, dysarthria [4]; Low-frequency ( $<$ 4 hz) tremor [5]
Alzheimer’s Disease (Dementia)	Abnormal fundamental frequency, pause and voice-break patterns, reduction in vocal range [6, 7]; Dysphonia [8]
Arthritis: Fibromyalgia	Changes in Jitter, shimmer, harmonic-to-noise ratio, and phonation time [9]
Cerebral Palsy	Dysphonia [10]; Breathiness, Asthenia, Roughness, Strain [11]
Cholera	Husky voice [12]; high-pitched, asthenia [13]; “Cholera voice” [14]
Congenital Heart Defects	Dystonia, dysarthria [4]
Coronavirus Disease 2019 (COVID-19)	Abnormal acoustic measures of pitch, harmonics, jitter, shimmer [15]; Asymmetric/abnormal vocal fold vibrations [16, 17]
Diabetes	Roughness, asthenia, breathiness, and strain in high glycemic index subgroups [18]
Down Syndrome	Reduced pitch range, higher mean fundamental frequency, reduced jitter [19]
Epilepsy (Temporal Lobe)	Abnormal voice pitch regulation [20]
Huntington’s disease	Low-frequency ( $<$ 4 hz) tremor [5]
Hypertension	Noisy breathing, voice production difficulty, vocal fatigue [21]
Hyperthermia (Extreme Heat)	Pressured speech [22]; Dysarthria [23]
Hypothermia (Extreme Cold)	Slurred speech [24]
Lung Cancer	Hoarseness, GRBAS symptoms, dysphonia [25]
Myasthenia gravis	Dystonia, dysarthria [4]
Multiple sclerosis	Low-frequency ( $<$ 4 hz) tremor [5]
Myxoedema	low-pitched, husky, and nasal voice, unusual diction [26]
Parkinson’s disease	Dystonia, dysarthria [4]; Low-frequency ( $<$ 4 hz) tremor, Medium tremor (4-7 Hz) [5]
Pneumonia	Wet, gurgly voice [27]
Pulmonary Hypertension	Hoarseness [28]
Stress	Breathiness, softness [29], Multiple changes [30]
Tobacco Use	Multiple voice quality changes, high jitter [31]

Table 1: Qualitative observations of the effect of some diseases on voice

2 Biomarkers and their measurement through proxy techniques

Proxy techniques may be used for the changes in voice that are human observable, but not easily measurable. For example, many of the correlations established between various diseases and voice changes in the medical literature refer to changes in voice quality. The changes, although subjective, comprise biomarkers that could potentially be used in machine learning systems for prediction of the corresponding diseases from voice. However, the problem is that for the most part, the entities that constitute the set of voice qualities are subjectively specified. For many, methodologies for objective measurement do not exist.

For example, the voice sub-qualities of nasality (or nasalence), roughness, breathiness, asthenia etc. have no objective measures associated with them. They are in fact rated by human experts on standardized clinical rating scales, such as Voice Rating scale (VRS) [32], Voice Disability Coping Questionnaire (VDCQ) [33], Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) [34] etc. Examples of such subjective correlations of various diseases to voice changes are listed in Table 1.

2.1 Proxy features

Features that are subjectively rated and do not have specific methods to objectively measure them can be measured through proxy. The term “proxy” here refers to the act of using (or creating) replacement features that can be measured instead of the original ones. For this to be viable, the proxy features must be highly correlated to the subjective features that we desire to measure. There are two potential mechanisms to create such proxy features: the use of physical models of voice production to produce signals that can be more easily measured, and the use of AI mechanisms for transfer learning that can generate measurable features that exhibit the same patterns as the subjective features they proxy for.

2.1.1 Proxy features from models of voice production: measurement through emulation

As an example, we consider physical models that can generate just one specific aspect of a continuous speech signal – the set of phonated sounds embedded in it. The idea is to use such models to generate or approximate the actual motion of the vocal folds during the process of phonation, producing a glottal flow signal that has characteristics (or specific voice sub-qualities) that are similar to a give recorded speech signal. Once this is achieved, the parameters of the model used can proxy for the speech quality characteristics of the original signal, that may not have been directly measurable. We can also call this process “measurement through emulation.” Examples of physical models of phonation that can be used include the 1-mass, mass-spring model [35], the 2-mass, mass-spring model [36] etc.

For a given recording, the parameters of these models are derivable through the ADLES algorithm, that minimizes the squared error between a glottal flow signal estimated through inverse filtering and the glottal flow signal generated through the model. The discriminability of such features is evident from the highly characteristic patterns exhibited by the corresponding model in its phase space. For example, while we know that there are changes in voice in response to various diseases of the vocal folds, and may have observed changes in voice as a result of covid, it is hard to characterize, identify or measure the exact changes in spectrographic and other signal representations. However, we can also see that the vocal fold movements (phonation process) would be affected by these diseases, and be highly correlated to the actual vocal fold oscillations as an affected person speaks. This is borne out by the phase space patterns exhibited by a 1-mass model, as shown in Fig. 1. We can therefore use the corresponding parameter values (and even other measurements that pertain to the phase space trajectories shown) to build predictors for the underlying conditions. The model parameters are then the proxy features we are looking for.

2.1.2 Proxy features from neural networks: measurement through correlation

Proxy features can also be derived using neural networks and other classifiers that learn to perform classification tasks on the signals within which we desire to identify or measure biomarkers. When trained to perform auxiliary tasks that are equivalent to those that must be actually performed using objective measures of the biomarker, the scores generated through the auxiliary classifiers act as proxy features for the biomarkers in question. For example, it is easy to identify, but difficult to measure changes in the “nasality” of speech signals. However, it is relatively easy to identify and isolate nasal and non-nasal phonemes in speech, and build classifiers to discriminate between them. When properly trained, an accurate classifier would generate scores that are discriminative of nasality characteristics. If this were not so, the classifier would not be able to accurately perform the task of discriminating between nasal and non-nasal phonemes. Once trained, the classifier can be used to generate scores for training newer predictors of underlying conditions based on nasality. that we wish to measure.

3 AI systems for biomarker discovery

With the advancements in the voice profiling techniques, it has become evident that for a large set of diseases for which correlations to voice were not known to exist, the presence of such correlations can in fact be hypothesized and scientifically supported. Since for these, we clearly know that biomarkers are not perceptible or evident within standard analytical representations of the voice signal, we must devise techniques to discover them or create them in appropriate mathematical spaces within which they can be shaped and measured.

One example of such a biomarker discovery system/framework is illustrated in Fig. 2. This represents a generic setup which we call the ABCDE framework (Autoencoder based Biomarker Creation and Discovery Engine). Its exact formulation can vary from application to application. In this framework, a speech signal is first converted into a numerical representation that is hypothesized to contain the biomarker related information that we seek to uncover, or extract. The representation is then “projected” into a neural kernel space, where we can impose objective criteria on it that are tailored to the specific biomarker.

Viewing Fig.2 from the left, from a given voice signal S, several types of signal representations are obtained using digital signal processing algorithms (DSP). Thereafter, one of those (SPi) is chosen as the substrate for the target feature to be created (carrying the relevant biomarker). The other representations (SP1, SP2,…) serve as “control” representations, and are placed within the DSP stack shown in the figure 2.

The cuboid for SPi represents a stack of the same “type” of DSP features, e.g. a stack of spectrograms obtained at different time-frequency resolutions, or a correlogram. These are input into a neural network E (e.g. a convolutional neural network), which can be viewed as an “encoder” that transforms them into a kernel space, yielding a latent feature representation Z. Within this space, different constraints (including those based on prior knowledge) can be imposed on Z, so that it has the desired properties of the biomarker for the targeted health condition. One obvious property it must possess is that it must be discriminative for the underlying medical condition it is expected to encode.

To ensure that the information present in the input representation preserved in the process of transformation into the kernel space, a decoder D is trained to reconstruct the original representation from Z, while minimizing the loss between the reconstructed and original representation. All loss functions are represented by red double-lines in Fig. 2. In the training process, which is that of parameter optimization of the aggregate neural framework through gradient descent, such a loss function would be minimized.

To ensure that only the information in the original input representation is engineered for the feature at hand, the same latent representation Z is input into a generator G that can recreate the original speech signal from it (as G(Z)). Alternatively, the G can act on the output of the decoder, SPo (as G(SPo)) to generate the voice signal. Time-domain signals with same information content can be different in voice acoustics and thus, more robust losses are defined based on comparisons of DSP features from the original signal, to those predicted from the voice signal generated by G. Alternatively, the DSP features can be “simulated” by a neural stack, as shown in the figure. An additional level of detail that allows voice quality features to play a role in this process of discovery is introduced by specifying rough functional relationships between DSP features and various voice qualities V. Losses based on direct comparisons between these can also play a part in the learning process.

The entire framework is simultaneously optimized, but it is conceivable to express this process of discovery within more sophisticated AI learning frameworks that prioritize interpretability and can be trained in parts. Here, Z represents the “discovered” feature carrying the biomarker whose existence was hypothesized.

4 Conclusions

The biomarker discovery mechanisms given above are generic ones, and may have varied formulations in different settings. In contrast to traditional features derived from audio signals for use with machine learning algorithms, these features are likely to perform better with lesser training data, since they are designed to be relatively more discriminating and less ambiguous for the specific disease for which they are designed. Such features are useful in many ways. For example, they can be used in applications that serve as diagnostic aids in clinical settings, to build tools for early-detection of certain diseases, for self-monitoring of health by disabled, elderly and under-resourced people, etc.

5 Acknowledgement

This material is based upon work supported by the Defence Science and Technology Agency, Singapore under contract number A025959. Its content does not reflect the position or policy of DSTA and no official endorsement should be inferred.

References

[1] Georg G von Polier, Eike Ahlers, Julia Amunts, Joerg Langner, Kaustubh R Patil, Simon B Eickhoff, Florian Helmhold, and Daina Langner, “Predicting adult attention deficit hyperactivity disorder (adhd) using vocal acoustic features,” medRxiv, 2021.
[2] Arnold E Aronson, William S Winholtz, Lorraine Olson Ramig, and Sandra R Silber, “Rapid voice tremor, or “flutter,” in amyotrophic lateral sclerosis,” Annals of Otology, Rhinology & Laryngology, vol. 101, no. 6, pp. 511–518, 1992.
[3] Anton Chen and C Gaelyn Garrett, “Otolaryngologic presentations of amyotrophic lateral sclerosis,” Otolaryngology—Head and Neck Surgery, vol. 132, no. 3, pp. 500–504, 2005.
[4] Chester Griffiths and I David Bough Jr, “Neurologic diseases and their effect on voice,” Journal of Voice, vol. 3, no. 2, pp. 148–156, 1989.
[5] Jan Hlavnička, Tereza Tykalová, Olga Ulmanová, Petr Dušek, Dana Horáková, Evžen Růžička, Jiří Klempíř, and Jan Rusz, “Characterizing vocal tremor in progressive neurological diseases via automated acoustic analyses,” Clinical Neurophysiology, vol. 131, no. 5, pp. 1155–1165, 2020.
[6] Juan José G Meilán, Francisco Martínez-Sánchez, Juan Carro, Dolores E López, Lymarie Millian-Morell, and José M Arana, “Speech in alzheimer’s disease: can temporal and acoustic parameters discriminate dementia?,” Dementia and Geriatric Cognitive Disorders, vol. 37, no. 5-6, pp. 327–334, 2014.
[7] Israel Martínez-Nicolás, Thide E Llorente, Francisco Martínez-Sánchez, and Juan José G Meilán, “Ten years of research on automatic voice and speech analysis of people with alzheimer’s disease and mild cognitive impairment: A systematic review article,” Frontiers in Psychology, vol. 12, pp. 645, 2021.
[8] Peak Woo, Janina Casper, Raymond Colton, and David Brewer, “Dysphonia in the aging: physiology versus disease,” The Laryngoscope, vol. 102, no. 2, pp. 139–144, 1992.
[9] Levent Gurbuzler, Ahmet Inanir, Kursat Yelken, Sema Koc, Ahmet Eyibilen, and Ismail Onder Uysal, “Voice disorder in patients with fibromyalgia,” Auris Nasus Larynx, vol. 40, no. 6, pp. 554–557, 2013.
[10] Kay Coombes, “Voice in people with cerebral palsy,” in Voice Disorders and their Management, pp. 202–237. Springer, 1991.
[11] Nick Miller, Lindsay Pennington, Sheila Robson, Ella Roelant, Nick Steen, and Eftychia Lombardo, “Changes in voice quality after speech-language therapy intervention in older children with cerebral palsy,” Folia Phoniatrica et Logopaedica, vol. 65, no. 4, pp. 200–207, 2013.
[12] T Narrain Sawmy Naidoo, “Notes of cases of cholera treated by sulphurous acid,” The Indian medical gazette, vol. 12, no. 8, pp. 219, 1877.
[13] CC Carpenter, “Cholera: diagnosis and treatment.,” Bulletin of the New York Academy of Medicine, vol. 47, no. 10, pp. 1192, 1971.
[14] Frederick Humphreys, The Cholera and Its Homoeopathic Treatment, Radde, 1849.
[15] Maral Asiaee, Amir Vahedian-Azimi, Seyed Shahab Atashi, Abdalsamad Keramatfar, and Mandana Nourbakhsh, “Voice quality evaluation in patients with covid-19: An acoustic analysis,” Journal of Voice, 2020.
[16] Mahmoud Al Ismail, Soham Deshmukh, and Rita Singh, “Detection of covid-19 through the analysis of vocal fold oscillations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 1035–1039.
[17] Soham Deshmukh, Mahmoud Al Ismail, and Rita Singh, “Interpreting glottal flow dynamics for detecting covid-19 from voice,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 1055–1059.
[18] Abdul-latif Hamdan, Jad Jabbour, Jihad Nassar, Iyad Dahouk, and Sami T Azar, “Vocal characteristics in patients with type 2 diabetes mellitus,” European Archives of Oto-Rhino-Laryngology, vol. 269, no. 5, pp. 1489–1495, 2012.
[19] Mary T Lee, Jude Thorpe, and Jo Verhoeven, “Intonation and phonation in young adults with down syndrome,” Journal of Voice, vol. 23, no. 1, pp. 82–87, 2009.
[20] Weifeng Li, Ziyi Chen, Nan Yan, Jeffery A Jones, Zhiqiang Guo, Xiyan Huang, Shaozhen Chen, Peng Liu, and Hanjun Liu, “Temporal lobe epilepsy alters auditory-motor integration for voice control,” Scientific reports, vol. 6, no. 1, pp. 1–13, 2016.
[21] HT Anil, N Lasya Raj, and Nikitha Pillai, “A study on etiopathogenesis of vocal cord paresis and palsy in a tertiary centre,” Indian Journal of Otolaryngology and Head & Neck Surgery, vol. 71, no. 3, pp. 383–389, 2019.
[22] Nazila Jamshidi and Andrew Dawson, “The hot patient: acute drug-induced hyperthermia,” Australian prescriber, vol. 42, no. 1, pp. 24, 2019.
[23] Megan E Musselman and Suprat Saely, “Diagnosis and treatment of drug-induced hyperthermia,” American Journal of Health-System Pharmacy, vol. 70, no. 1, pp. 34–42, 2013.
[24] Ahmed Faraz Aslam, Ahmad Kamal Aslam, Balendu C Vasavada, and Ijaz A Khan, “Hypothermia: evaluation, electrocardiographic manifestations, and management,” The American journal of medicine, vol. 119, no. 4, pp. 297–301, 2006.
[25] Clare F Lee, Paul N Carding, and Mike Fletcher, “The nature and severity of voice disorders in lung cancer patients,” Logopedics Phoniatrics Vocology, vol. 33, no. 2, pp. 93–103, 2008.
[26] WH Lloyd, “Value of the voice in diagnosis of myxoedema in the elderly,” British medical journal, vol. 1, no. 5131, pp. 1208, 1959.
[27] Paul E Marik and Danielle Kaplan, “Aspiration pneumonia and dysphagia in the elderly,” Chest, vol. 124, no. 1, pp. 328–336, 2003.
[28] KD Shah, KH Ayyer, and UK Shah, “Hoarseness of voice—a presenting manifestation of primary pulmonary hypertension,” Indian Journal of Otolaryngology, vol. 32, no. 2, pp. 35–36, 1980.
[29] Keith W Godin and John HL Hansen, “Physical task stress and speaker variability in voice quality,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, pp. 1–13, 2015.
[30] Klaus R Scherer, “Voice, stress, and emotion,” in Dynamics of stress, pp. 157–179. Springer, 1986.
[31] Isabel Guimarães and Evelyn Abberton, “Health and voice quality in smokers: an exploratory investigation,” Logopedics Phoniatrics Vocology, vol. 30, no. 3-4, pp. 185–191, 2005.
[32] Shin-Woong Cho, Chang Shik Yin, Young-Bae Park, and Young-Jae Park, “Differences in self-rated, perceived, and acoustic voice qualities between high-and low-fatigue groups,” Journal of Voice, vol. 25, no. 5, pp. 544–552, 2011.
[33] Ruth Epstein, Shashivadan P Hirani, Jan Stygall, and Stanton P Newman, “How do individuals cope with voice disorders? introducing the voice disability coping questionnaire,” Journal of Voice, vol. 23, no. 2, pp. 209–217, 2009.
[34] Gail B Kempster, Bruce R Gerratt, Katherine Verdolini Abbott, Julie Barkmeier-Kraemer, and Robert E Hillman, “Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol,” American Journal of Speech-Language Pathology (AJSLP), vol. 18, no. 2, pp. 124–132, 2009.
[35] Jorge C Lucero and Jean Schoentgen, “Modeling vocal fold asymmetries with coupled van der pol oscillators,” in Proceedings of Meetings on Acoustics ICA2013. Acoustical Society of America, 2013, vol. 19, p. 060165.
[36] Jorge C Lucero, “Dynamics of the two-mass model of the vocal folds: Equilibria, bifurcations, and oscillation region,” The Journal of the Acoustical Society of America, vol. 94, no. 6, pp. 3104–3111, 1993.