This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-Based Masking for Speech Emotion Recognition

Abstract

Self-supervised learning (SSL) for automated speech recognition in terms of its emotional content, can be heavily degraded by the presence noise, affecting the efficiency of modeling the intricate temporal and spectral informative structures of speech. Recently, SSL on large speech datasets, as well as new audio-specific SSL proxy tasks, such as, temporal and frequency masking, have emerged, yielding superior performance compared to classic approaches drawn from the image augmentation domain. Our proposed contribution builds upon this successful paradigm by introducing CochCeps-Augment, a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations. Specifically, we utilize the newly introduced bio-inspired cochlear cepstrogram (CCGRAM) to derive noise robust representations of input speech, that are then further refined through a self-supervised learning scheme. The latter employs SimCLR to generate contrastive views of a CCGRAM through masking of its angle and quefrency dimensions. Our experimental approach and validations on the emotion recognition K-EmoCon benchmark dataset, for the first time via a speaker-independent approach, features unsupervised pre-training, linear probing and fine-tuning. Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis, showing the added value of incorporating bio-inspired masking as an informative augmentation task for self-supervision. Our code for implementing CochCeps-Augment will be made available at: https://github.com/GiannisZgs/CochCepsAugment.

Index Terms:
CochCeps-Augment, Self-Supervised Learning, Contrastive Learning, SimCLR, Cochlear Cepstrum, Cepstral Augmentation, Bio-inspired SSL, Speech Emotion Recognition.
publicationid: pubid: 978-1-5386-5541-2/18/$31.00 ©2024 IEEE       

I Introduction

Retrieving information implicitly from spoken language and natural sounds has been believed to constitute one of the core learning mechanisms during development in humans, who essentially form a perception of their acoustic surroundings based on salient acoustic representations [1]. Similarly, advances in neural networks have formed the yet nascent, though promising, Self-Supervised representation Learning (SSL) field, in an effort to mimic human acquisition of knowledge that does not rely on explicitly taught or labeled paradigms [1]. Contrastive learning, a sub-category of SSL, leverages simple augmentations to generate multiple views of a sample, in order to foster invariance, and then encourage similarity between their feature representations [1].

In the area of speech and audio processing, SSL provides a compelling paradigm that can mitigate the lack of sufficiently large and well-annotated datasets [2] , hence facilitating the learning of intelligible speech features that enhance the performance on a multitude of downstream tasks. SSL concepts from computer vision and natural language processing such as masking [3, 4] have been extended to audio processing, and have successfully exploited vast unlabeled speech datasets to create large, pre-trained-on-speech models such as, wav2vec2.0 [5] and HuBERT [6]. However, SSL for audio has limitations that relate to the temporal structure of time series, as well as the heavy noise contamination usually present in audio [7], factors that limit the application of classic contrastive image augmentations to audio or audio-derived representations, such as, spectrograms and MFCCs. To that end, audio-tailored augmentation methods that are based on masking have been designed, such as, SpecAugment [8] and Mask Spec [2] and have been successfully applied for contrastive learning in audio data [9].

Among the recently evolving tasks are speech emotion recognition (SER); which currently plays a huge role in Human-Computer Interaction (HCI) [10], as well as various healthcare [11] and e-learning paradigms in education applications [12]. SER constitutes a sub-field of automatic speech recognition that benefits from feature representations of the raw audio such as spectrograms or MFCCs [13, 14]. SSL masking-based approaches in particular, have proven to be very effective in the SER paradigm, yet large pre-trained models on generic speech datasets are preferred, as these representations have been proven to be beneficial for the task of SER [15].

Refer to caption
Fig. 1: Our proposed bio-inspired CochCeps-Augment SSL framework: (a) Cochlear Cepstrum, (b) SimCLR Contrastive pre-training

We believe that bio-inspired signal representation methods could guide contrastive learning methods to learn noise-robust intelligible features, thereby enhancing the generalization ability and efficacy of the model for the problem at hand (i.e., SER). To the best of our knowledge, this is the first work that attempts to enhance synergies between machine intelligence and human perception by adopting a bio-inspired cochlear cepstral representation of speech signals for a novel SSL framework in SER. As opposed to the well-known Mel Frequency Cepstral Coefficients (MFCCs) [16] and GammaTone filterbank Cepstral Coefficients (GTCCs) [17], we adopt the Cochlear Filterbank Cepstral Coefficients (CFCCs), also referred to as cochlear cepstrogram (CCGRAM), that mimic both the function and the structure of the human cochlea [18]. The human cochlea is characterized by a spiral structure that encodes the frequency-position map; which is the essence of the superb frequency resolution of the human ear. In particular, the cochlear spiral is geometrically composed of a bit more than two and a half turns spanning θ=0°\theta=0\degree at the base (high frequency hearing, up to 20 kHz) of the cochlea to θ=990°\theta=990\degree at the apex (low frequency hearing, up to 10 Hz) of the cochlea [19].

In light of the aforementioned, the contribution of our work is a novel augmentation method which we call CochCeps-Augment. Our method draws inspiration from SpecAugment [8], however, in contrast to SpecAugment, our method operates on the image representation of the CCGRAM of the input audio by applying masking along the angle and quefrency axis, therefore encouraging self-supervision to attend to meaningful tonotopically-organized audio properties that are intelligible to humans. CochCeps-Augment is simple and cost-effective to be applied during self-supervised pre-training and due to the bio-inspired nature of the CFCCs [18], exhibits enhanced noise robustness which is highly desirable in audio analysis. The results of applying CochCeps-Augment on the K-EmoCon dataset [20] for the first time in a speaker-independent manner, via a self-supervised contrastive learning scheme, showcase the potentiality of our bio-inspired masking augmentation task. We believe that CochCeps-Augment will not only enhance SER tasks, but will also likely unlock new avenues for various applications in speech and acoustic signal processing; especially by blending human auditory mechanisms into SSL. An overview of our method is presented in Figure 1.

II Cochlear Cepstrum Background

The human ear shows remarkable ability in recognizing speech content under variable conditions of background noise, thereby inspiring a range of signal processing and representation methods [21, 22]. We employed a recently proposed bio-inspired, noise-robust feature space, called Cochlear Filterbank Cepstral Coefficients (CFCCs) [18]. Such CFCCs replicate both the structure and the function of the human spiral cochlea and are based on the concept of the cochlear transform (CT) [22]. The latter is briefly a novel and general signal processing framework that mimics the active and non-linear multiscale analysis of the cochlea for acoustic signals. The resulting Cochlear Modes are transformed to the cochlear cepstral space by means of logarithmic transformation and Discrete Cosine Transform (DCT) of the resulting modes’ energy. Hence, for the computation of the CFCCs of a signal p(t)p(t), we start by computing the orthogonal cochlear modes (in the frequency domain), denoted FCTFCT that result from the CT, as follows:

FCTp(θ,ω)=θP(ω)Φ(θ,ω),\centering FCT_{p}(\theta,\omega)=\sqrt{\theta}P(\omega)\Phi^{*}(\theta,\omega),\@add@centering (1)

where θ\theta is the angle along the spiral cochlear space; namely θ[0°,990°]\theta\in[0\degree,990\degree]. It should be noted that the tonotopic place-pitch map is defined as follows:

f(θ)=165.4(32512.1(θ+177.3)2.1×1.1490.88),f(\theta)=165.4(3251^{2.1}(\theta+177.3)^{-2.1\times 1.149}-0.88), (2)

where ff is the frequency and θ\theta is the angular position across the spiral cochlea. It should be noted that θ=0°\theta=0\degree corresponds to the base (High frequency region) and θ=990°\theta=990\degree corresponds to the apex of the cochlea (Low frequency region). Afterwards, we compute the log magnitude spectrum and the DCT to extract the tonotopically (spatially) organized cochlear cepstral coefficients. By definition, the real CFCCs at a specific angular position θ\theta, is:

CFCC(m,θ)=2Kk=1Klog(Xk)cos[πkK(m12)],\centering CFCC(m,\theta)=\sqrt{\frac{2}{K}}\sum_{k=1}^{K}log(X_{k})cos[\frac{\pi k}{K}(m-\frac{1}{2})],\@add@centering (3)

where CFCC(m,θ)CFCC(m,\theta) is the mth cochlear cepstral coefficient at angle θ\theta and mm is the cochlear quefrency index, 1mM1\leq m\leq M, KK is the number of cochlear modes and XkX_{k} is the energy of the kth cochlear mode (i.e., FCTp(θ)FCT_{p}(\theta)).

III Proposed Method

III-A Cepstral Masking Augmentation Policy

To enhance the expressivity of the CCGRAM features in modelling speech in a SSL setting, we design our masking augmentations to operate along both the angle θ\theta and quefrency axis mm. In doing so, we encourage self-supervision to infer the masked frequency tones arising from particular angular positions along the cochlea and masked speech segments along the time domain, both individually and in a combined fashion. Therefore, in a given sample with two cepstral-augmented views, the model should be able to predict the missing tones given the presence of other, contextually and acoustically relevant tones. Hence, it becomes evident that the proposed CochCeps-Augment proxy task, is relevant both for generic speech and audio processing, as well as in the context of SER. This leads us to the following three augmentation transforms:

  1. 1.

    Angle masking, applied along the angle θ\theta axis, so that Φ\Phi distinct masks are applied. Each mask spans ϕ\phi consecutive angle bands, i.e. [θ0,θ0+ϕ)[\theta_{0},\theta_{0}+\phi), where ϕ\phi is sampled randomly from a uniform distribution from 0 to Φ\Phi, and θ0\theta_{0} is a random angle sampled from [0,990ϕ)[0,990-\phi).

  2. 2.

    Quefrency masking, applied along the quefrency indexes mm, so that QQ distinct masks are applied. Each mask spans QQ consecutive quefrency indexes, i.e. [m0,m0+q)[m_{0},m_{0}+q), where q is sampled randomly from a uniform distibution from 0 to QQ, and m0m_{0} is chosen from the [0,Mm)[0,M-m).

  3. 3.

    Cepstral masking, defined as the simultaneous masking of both angle bands and quefrency indexes according to the above parameters.

Figure 2 shows an example of the application of our proposed augmentations to an input CCGRAM.

Refer to caption
Fig. 2: Family of the proposed CCGRAM augmentations applied on a single sample (for θ[0°,990°]\theta\in[0\degree,990\degree] and quefrency index mm).

III-B SimCLR - Self-Supervised Contrastive Learning of Representations

To take advantage of the masking benefits that CochCeps-Augment adds to the learning process of a SSL system, we adopt SimCLR as our pre-training framework [23]. In SimCLR, two augmentation transforms (tTt\sim T and tTt^{\prime}\sim T) are sampled from the family of the proposed CochCeps-Augment transforms TT. SimCLR promotes similarity between two augmented views of an input sample, designed to maintain invariance on information axes that are deemed redundant. Specifically, meaningful representations can be learned by maximizing the similarity between positive pairs (views of the same sample) and minimizing the similarity between negative pairs (views belonging to different samples) in a latent space. To that end, every sample CCGRAM xx is masked twice in order to produce two distinct augmented views, xix_{i} and xjx_{j}. By forward-propagating the two views through a sharedshared encoderencoder f(.)f(.), two representations hih_{i} and hjh_{j} are obtained, which are further projected through a projectorprojector g(.)g(.) to obtain the final representations 𝐳i\mathbf{z}_{i} and 𝐳j\mathbf{z}_{j}.

The self-supervision without labels in SimCLR, is facilitated by the NT-Xent (normalized temperature-scaled cross-entropy) loss function, which is given by:

SimCLR(i,j)=log(exp(sim(𝐳i,𝐳j)/τ)k=12N𝐈[ki]exp(sim(𝐳i,𝐳k)/τ))\mathcal{L}_{\text{SimCLR}}(i,j)=-\log\left(\frac{\exp(\text{sim}(\mathbf{z}_{i},\mathbf{z}_{j})/\tau)}{\sum_{k=1}^{2N}\mathbf{I}[k\neq i]\exp(\text{sim}(\mathbf{z}_{i},\mathbf{z}_{k})/\tau)}\right)

where 𝐳i\mathbf{z}_{i} and 𝐳j\mathbf{z}_{j} are the representations of augmented views ii and jj, sim(𝐳i,𝐳j)=𝐳i𝐳j𝐳i𝐳j\text{sim}(\mathbf{z}_{i},\mathbf{z}_{j})=\frac{\mathbf{z}_{i}\cdot\mathbf{z}_{j}}{\|\mathbf{z}_{i}\|\|\mathbf{z}_{j}\|} is the cosine similarity, τ\tau is a temperature parameter, 𝐈[]\mathbf{I}[\cdot] is the indicator function and NN are the number of samples in a batch.

In the pre-training phase, the encoder f(.)f(.) and the projector g(.)g(.) are trained end-to-end and evaluated through the NT-Xent loss. In downstream evaluations, the projector is discarded and the encoder f(.)f(.) is used as a feature extractor that yields the features hih_{i} and hjh_{j}.

IV Experiments

IV-A K-EmoCon Database

The K-EmoCon database contains multi-modal recordings from 32 participants divided in 16 groups [20]. Participants engaged in a dyadic debate in English ( 10 minutes) during which multi-modal data acquisition took place. For emotional labeling, we adopt the self-rated arousal/valence (A/V) space quadrant scheme (see Table I), where the provided integer ratings in the range [1,5][1,5], are binned to a specific quadrant based on their combined arousal and valence, yielding four classes: LowA/LowV (LALV), LowA/HighV (LAHV), HighA/LowV (HALV), HighA/HighV(HAHV).

Table I: Quadrant Annotations: Arousal (a) /Valence (v) values are binned to one of the four quadrants of the A/V space.
Arousal Valence
LALV a<3a<3 v<3v<3
LAHV a<3a<3 v3v\geq 3
HALV a3a\geq 3 v<3v<3
HAHV a3a\geq 3 v3v\geq 3

Data collection in K-EmoCon was modelled as a natural conversational setting, hence various sources of noise are present. Moreover, the problem of source separation needs to be addressed in speech segments where participants and the referee’s voices overlap, as well as when heavy degradation occurs due to loud, non-conversation-related sounds. To that end, we design a combined pre-processing strategy which is mainly focused on source separation to obtain speaker-independent speech segments. The raw audio is first downsampled from the initial sampling rate 22.5 kHz to 16kHz, followed by max scaling. Silent segments are then removed with a Short-Time Fourier Transform energy thresholding [24]. We manually annotate speech segments identified as sound and based on their content, we apply one of two source separation techniques. For stationary noise, we apply the WTST-NST filter [25] to isolate the non-stationary speech. For speaker separation, we utilize a pre-trained on the Librimix dataset Sepformer model that outputs two speaker signals [26]. Finally, resulting speech segments are scaled again, and segmented into 3 second windows. To ensure uniform input to our deep learning models, we zero-pad segments that are shorter than 3s, and discard segments that are shorter than 1s.

We compute the CCGRAM of an input 3-second segment according to [18], using Hamming windows of 25ms duration with 50% overlap. We set the angle spacing between cochlear modes at θ=45°\theta=45\degree, resulting in CCGRAMs of size 20x239.

IV-B Self-Supervised Pre-training

Prior to any processing, we split the data into 4 distinct folds: pre-training (26 speakers), validation (2 speakers), fine-tuning (2 speakers) and test (2 speakers). For SSL pre-training we use the pre-training and validation folds, while for evaluation we use the pre-training, validation and test folds in the linear probing, and the fine-tuning, validation and test folds in the fine-tuning. We follow a speaker-independent approach, that relies on a 5-fold cross-validation scheme of K-EmoCon. Specifically, we design the folds in order to ensure that all speakers are present once at each fold, and that all folds contain only whole speakers.

For our augmentation pipeline, we first z-normalize all images in a fold with the mean value and standard deviation of that fold. Then, we generate two views of each input CCGRAM by randomly selecting one of the three masking transforms in the family of CochCeps-Augment for each view. The parameters Φ\Phi and QQ for the angle and quefrency masking are evaluated as 2 and 5, respectively. After masking, we resize the CCGRAM to a size of 239x239 with nearest-neighbors interpolation. We pre-train from scratch a ResNet18 to be used as the encoder f(.)f(.) [27], where we slightly modify the input convolutional layer to accept single-channel intensity images. The projector g(.)g(.) is implemented as a two-layered linear head with an output dimension of 256. For the NT-XEnt loss, we choose a temperature parameter of 0.07. For our implementation via PyTorch, we trained on a machine with four NVIDIA RTX6000 ADA GPUs, with a batch size of 64 for 1000 epochs on each pre-train fold. The LARS optimizer was used [28], with a starting learning rate of 0.1 and a weight decay parameter of 1e-6. A warm-up period of 10 epochs is followed by training with the cosine decay scheduler [29].

IV-C Evaluations

We evaluate the performance of our SSL approach through two schemes: linear probing and fine-tuning. It should be noted that in the evaluations, we do not apply CochCeps-Augment but only the z-normalization and resizing operations. Linear probing refers to the evaluation of the frozen encoder f(.)f(.) through a simple linear head, that is used to perform the downstream task. Linear probing constitutes a simple evaluation method and allows us to understand the pure quality of the features that the encoder learned during the pre-training phase [1]. Moreover, to test the ability of the contrastive self-supervision to recover the CCGRAM-encoded speech information, we conduct a sanity check where the flattened CCGRAM is fed directly to the linear probe. We expect that this evaluation will reveal the information gain that our proposed pre-training approach contributes to the end result. From a technical perspective, this is attributed to the fact that flattening disrupts the spatio-temporal cochlear cepstral features. Fine-tuning on the other hand, utilizes a non-linear head for downstream classification, and refers to the classic process of fine-tuning the whole network, including the encoder. We train with linear probing on the same data that was used for the pre-training, and evaluate on the left-out test fold. For fine-tuning, we tune on the fine-tuning fold and evaluate on the test fold. On both settings, we train with a batch size of 16, for 50 epochs to avoid overfitting. The Adam optimizer is used [30] with a starting learning rate of 1e-4 for linear probing and 5e-6 for fine-tuning, with weight decay of 1e-6 and a cosine decay schedule. For performance evaluation, we use the weighted accuracy and weighted F1-score.

Table II: SSL Evaluation Results on K-EmoCon: Averages over 5-fold Speaker-Independent Cross-Validation
Weighted Accuracy Weighted F1
Linear Probing – Flattening 0.42 0.45
Linear Probing – ResNet18 0.61 0.50
Fine-Tuning – ResNet18 0.69 0.57

V Results and Discussion

The results of our proposed CochCeps-Augment-driven SSL approach for the SER task can be seen in Table II. We employed two main evaluation schemes for the proposed approach; namely linear probing and fine-tuning. As illustrated in Table II, for the linear probing task, the weighted accuracy/F1 score are 0.42/0.45 for the flattening sanity check, and 0.61/0.50 for the ResNet18 pre-trained encoder; respectively. However, the performance is significantly enhanced upon the fine-tuning stage; yielding an accuracy and F1 score of 0.69 and 0.57 respectively. This indicates that self-supervision through CochCeps-Augment and further refinement of the learned features in the downstream speaker-independent SER task, can indeed uncover meaningful non-redundant representations.

As K-EmoCon is a small-scale emotion recognition corpus, we believe that pre-training on large-scale speech corpora and validating on more emotional speech corpora, would benefit the power of our conclusions. Moreover, longer training times have been proven beneficial for SSL feature extractors [9] and thus could enhance our results. It is still not clearly understood how each distinct masking scheme of CochCeps-Augment affects the learned representations; angle masking for instance, steers the learning towards perceived tonotopical relationships in an input utterance, whereas quefrency masking promotes learning of contextual links between speech segments. However, simultaneous or excess masking of angle and quefrency content may irreversibly degrade the CCGRAM and hence, the capacity of self-supervision to recover information from the perturbed CCGRAM. Finally, CochCeps-Augment is a task-agnostic SSL proxy task for speech; therefore, we hypothesize that representations learned through CochCeps-Augment could improve speech recognition systems in a variety of tasks, i.e. speaker identification, automatic speech recognition, audio events classification. Investigation towards these directions is already underways.

VI Conclusion

In this work, we have presented our bio-inspired masking augmentation method, namely CochCeps-Augment, for learning self-supervised speech representations through contrastive learning in the context of SER. We demonstrated, for the first time, how CochCeps-Augment can be seamlessly and cost-effectively integrated in a resource-demanding contrastive SSL setting through SimCLR, in the context of SER, with promising results. This novel approach showcases how human perception of sounds, encapsulated in the signal processing framework of the cochlea, can be placed in the epicentre of a speech self-supervision model, which by design tries to implicitly mimic the way humans perceive auditory events.

References

  • [1] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum, “A Cookbook of Self-Supervised Learning,” 4 2023.
  • [2] Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [3] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick, “Masked Autoencoders Are Scalable Vision Learners,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, pp. 15979–15988, 11 2021.
  • [4] Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, 10 2018.
  • [5] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  • [6] Wei Ning Hsu, Benjamin Bolte, Yao Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, pp. 3451–3460, 6 2021.
  • [7] Weiran Wang, Qingming Tang, and Karen Livescu, “Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 6889–6893, 5 2020.
  • [8] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 2613–2617, 4 2019.
  • [9] Pratham N Soni, Siyu Shi, Pranav R Sriram, Andrew Y Ng, and Pranav Rajpurkar, “Contrastive learning of heart and lung sounds for label-efficient diagnosis,” Patterns, vol. 3, no. 1, 2022.
  • [10] Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal processing magazine, vol. 18, no. 1, pp. 32–80, 2001.
  • [11] Marwan Dhuheir, Abdullatif Albaseer, Emna Baccour, Aiman Erbad, Mohamed Abdallah, and Mounir Hamdi, “Emotion recognition for healthcare surveillance systems using neural networks: A survey,” in 2021 International Wireless Communications and Mobile Computing (IWCMC). IEEE, 2021, pp. 681–687.
  • [12] Wu Li, Yanhui Zhang, and Yingzi Fu, “Speech emotion recognition in e-learning system based on affective computing,” in Third international conference on natural computation (ICNC 2007). IEEE, 2007, vol. 5, pp. 809–813.
  • [13] Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng, “Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, pp. 7367–7371, 3 2022.
  • [14] Yurun He, Nobuaki Minematsu, and Daisuke Saito, “Multiple Acoustic Features Speech Emotion Recognition Using Cross-Attention Transformer,” pp. 1–5, 5 2023.
  • [15] Sofoklis Kakouros, Themos Stafylakis, Ladislav Mošner, and Lukáš Burget, “Speech-Based Emotion Recognition with Self-Supervised Models Using Attentive Channel-Wise Correlations and Label Smoothing,” pp. 1–5, 5 2023.
  • [16] Nilu Singh, RA Khan, and Raj Shree, “Mfcc and prosodic feature extraction techniques: a comparative study,” International Journal of Computer Applications, vol. 54, no. 1, 2012.
  • [17] Xavier Valero and Francesc Alias, “Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification,” IEEE transactions on multimedia, vol. 14, no. 6, pp. 1684–1689, 2012.
  • [18] Hessa Alfalahi, Ahsan Khandoker, and Leontios Hadjileontiadis, “Spiral shape matters: Novel bio-inspired cochlear cepstrum,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1–5.
  • [19] Helge Rask-Andersen, Wei Liu, Elsa Erixon, Anders Kinnefors, Kristian Pfaller, Annelies Schrott-Fischer, and Rudolf Glueckert, “Human cochlea: anatomical characteristics and their relevance for cochlear implantation,” The Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology, vol. 295, no. 11, pp. 1791–1811, 2012.
  • [20] Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee, “K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations,” Scientific Data 2020 7:1, vol. 7, no. 1, pp. 1–16, 9 2020.
  • [21] Hessa Alfalahi, Ahsan Khandoker, Ghada Alhussein, and Leontios Hadjileontiadis, “Cochlear decomposition: A novel bio-inspired multiscale analysis framework,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [22] Hessa Alfalahi, Ahsan Khandoker, Georgios Apostolidis, and Leontios Hadjileonitiadis, “Cochlear transform,” Authorea Preprints, 2023.
  • [23] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” 11 2020.
  • [24] Theodoros Giannakopoulos, “A method for silence removal and segmentation of speech signals, implemented in matlab,” .
  • [25] Leontios J. Hadjileontiadis M, Theodore A. Rokkas, and Stavros M. Panas, “Enhancement of bowel sounds by wavelet-based filtering,” IEEE Transactions on Biomedical Engineering, vol. 47, no. 7, pp. 876–886, 2000.
  • [26] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio, “SpeechBrain: A General-Purpose Speech Toolkit,” 6 2021.
  • [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 770–778, 12 2015.
  • [28] Yang You, Igor Gitman, and Boris Ginsburg, “Large Batch Training of Convolutional Networks,” 8 2017.
  • [29] Ilya Loshchilov and Frank Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 8 2016.
  • [30] Diederik P. Kingma and Jimmy Lei Ba, “Adam: A Method for Stochastic Optimization,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 12 2014.