This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Novel Policy for Pre-trained
Deep Reinforcement Learning for
Speech Emotion Recognition

Thejan Rajapakshe12, Rajib Rana2, Sara Khalifa3 , Björn W. Schuller46, Jiajun Liu3
2University of Southern Queensland, Australia
3Distributed Sensing Systems Group, Data61, CSIRO Australia
4 GLAM – Group on Language, Audio & Music, Imperial College London, UK
6ZD.B Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany
1 [email protected]
Abstract

Reinforcement Learning (RL) is a semi-supervised learning paradigm where an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming – such as AlphaGo, but its potential has rarely being explored for challenging tasks like Speech Emotion Recognition (SER). Deep RL being used for SER can potentially improve the performance of an automated call centre agent by dynamically learning emotion-aware responses to customer queries. While the policy employed by the RL agent plays a major role in action selection, there is no current RL policy tailored for SER. In addition, an extended learning period is a general challenge for deep RL, which can impact the speed of learning for SER. Therefore, in this paper, we introduce a novel policy – the “Zeta policy” which is tailored for SER and apply pre-training in deep RL to achieve a faster learning rate. Pre-training with a cross dataset was also studied to discover the feasibility of pre-training the RL agent with a similar dataset in a scenario where real environmental data is not available. The IEMOCAP and SAVEE datasets were used for the evaluation with the problem being to recognise the four emotions happy, sad, angry, and neutral in the utterances provided. The experimental results show that the proposed “Zeta policy” performs better than existing policies. They also support that pre-training can reduce the training time and is robust to a cross-corpus scenario.

Index Terms:
Machine Learning, Deep Reinforcement Learning, Speech Emotion Recognition

1 Introduction

Reinforcement Learning (RL), a semi-supervised machine learning technique, allows an agent to take actions and interact with an environment to maximise the total rewards. RL's main popularity comes from its super-human performance in solving some games like AlphaGo [1] and AlphaStar [2].

RL has been also employed for audio-based applications showing its potential for audio enhancement [3, 4], automatic speech recognition [5], and spoken dialogue systems [6, 7]. The potential of using RL for speech emotion recognition has been recently demonstrated for robot applications where the robot can detect an unsafe situation earlier given a human utterance [8]. An emotion detection agent was trained to achieve an accuracy-latency trade-off by punishing wrong classifications as well as too late predictions through the reward function. Motivated by this recent study, we employ RL for speech emotion recognition where a potential application of the proposed system could be an intelligent call centre agent, learning over time how to communicate with human customers in an emotionally intelligent way. We consider the speech utterance as “state” and the classified emotion as “action”. We consider a correct classification as a positive reward and negative reward, otherwise.

RL widely uses Q-learning, a simple, yet quite powerful algorithm to create a state-action mapping, namely Q-table, for the agent. This, however, is intractable for a large or continuous state and/or action space. First, the amount of memory required to save and update that table would increase as the number of states increases. Second, the amount of time required to explore each state to create the required Q-table would be unrealistic. To address these issues, the deep Q-Learning algorithm uses a neural network to approximate a Q-value function. Deep Q-Learning is an important algorithm enabling deep Reinforcement Learning (deep RL) [9].

The standard deep Q-learning algorithm employs a stochastic exploration strategy called ϵ\epsilon-greedy, which follows a greedy policy according to the current Q-value estimate and chooses a random action with probability ϵ\epsilon. Since the application of RL in speech emotion recognition (SER) is mostly unexplored, there is not enough evidence if the ϵ\epsilon-greedy policy is best suited for SER. In this article, we investigate the feasibility of a tailored policy for SER. We propose a Zeta-policy 111Code: https://github.com/jayaneetha/Zeta-Policy and provide analysis supporting its superior performance compared to ϵ\epsilon-greedy and some other popular policies.

A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach a reasonable accuracy, making it inapplicable in real-world settings [10]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD)) in RL has recently gained traction as a possible way of speeding up deep RL [11, 12, 13]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on, learns to surpass the demonstrator [10]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [14]. Therefore, LfDs are generally not scalable, especially for high-dimensional problems.

We propose the technique of pre-training the underlying deep neural networks to speed up training in deep RL. It enables the RL agent to learn better features leading to better performance without changing the policy learning strategies [10]. In supervised methods, pre-training helps regularisation and enables faster convergence compared to randomly initialised networks [15]. Various studies (e. g., [16, 17]) have explored pre-training in speech recognition and achieved improved results. However, pre-training in deep RL is hardly explored in the area of speech emotion recognition. In this paper, we present the analysis showing that pre-training can reduce the training time. In our envisioned scenario, the agent might be trained with one corpus but might need to interact with other corpora. To test the performance in those scenarios, we also analyse performance for cross-corpus pre-training.

2 Related Work and Background

Reinforcement learning has not been widely explored for speech emotion recognition. The closest match of our work is EmoRL where the authors use deep RL to determine the best position to split the utterance to send to the emotion recognition model [8]. The RL agent chooses to decide an “action” whether to wait for more audio data or terminate and trigger prediction. Once the terminate action is selected, the agent stops processing the audio stream and starts classifying the emotion. The authors aim to achieve a trade-off between accuracy and latency by penalising wrong classifications as well delayed predictions through rewards. In contrast, our focus is on developing a new policy tailored for SER, and apply pre-training to achieve faster learning rate.

Another related study has used RL to develop a music recommendation system based on the mood of the listener. The users select their current mood by selecting “pleasure” and “energy”, then, the application selects a song out of its repository. The users can provide feedback on the system recommended audio by answering the question “Does this audio match the mood you set?” [18]. Here, the key focus is to learn the mapping of a song to the selected mood, however, in this article, we focus on the automatic determination of the emotion. Yu and Yang introduced an emotion-based target reward function [19], which again did not have a focus on SER.

A group of studies used RL for audio enhancement. Shen et al. showed that introducing RL for enhancing the audio utterances can reduce the testing phase character error rate by about 7 % of automatic speech recognition with noise [3]. Fakoor et al. also used RL for speech enhancement while considering the noise suppression as a black box and only taking the feedback from the output as the reward. They achieved an increase of 42 % of the signal to noise ratio of the output [4].

On the topic of proposing new policies, Lagoudakis and Parr used a modification of approximate policy iteration for pendulum balancing and bicycle riding domains [20]. We propose a policy which is tailored for SER. Many studies can be found in the literature using Deep learning to capture emotion and related features from speech signal, however, none of them had a focus on RL [21, 22, 23].

An important aspect of our study is incorporating pre-training in RL to achieve a faster learning rate in SER. Gabriel et al. used human demonstrations of playing Atari games to pre-train a deep RL agent to improve the training time of the RL agent [10]. Hester et al. in their introduction of Deep Q-Learning from Demonstrations, presented prior recorded demonstrations to a deep RL system of playing 42 games – 41 of them had improved performance and 11 out of them achieved state-of-the-art performance [24]. An evidential advance in performance and accuracy is observed in their results of the case study with Speech Commands Dataset. None of these studies focuses on using pre-training in RL for SER, which is our focus.

2.1 Reinforcement Learning

RL architecture mainly consist of two major components namely “Environment” and “Agent” while there are three major signals passing between those two agents as “current state”, “reward” and “next state”. An agent interacts with an unknown environment and observes a state stSs_{t}\in S at every time step t[0,T]t\in[0,T], where SS is the state space and TT is the terminal time. The agent selects an action aA{a\in A}, where AA is the action space and then, the environment changes to st+1s_{t+1} and the agent receives a reward scalar rt+1r_{t+1} which represents a feedback on how good the selected action on the environment is. The agent learns a policy π\pi to map most actions aa for a given state ss. The objective of RL is to learn π\pi^{*}, an optimal policy which maximises the cumulative reward that can map the most suitable actions to a given state ss.

2.1.1 Q-Learning

Before introducing Q-Learning, introduction to a number of key terms is necessary.

Model Free and Model based Policy: The model learns the transition probability from the current state for an action to the next state. Although it is straight forward, model-based algorithms become impractical as the state space and action space grows. In contrary, model-free algorithms rely on trial-and-error to update its knowledge and does not require to store all the combination of states and actions.

On-Policy learning and Off-policy learning: These two policies can be described in-terms of Target Policy and Behaviour Policy. Target Policy is the policy, which an agent tries to learn by learning value function. Behavior Policy is used by an agent for action selection or in other words used to interact with the environment.

In On-policy learning, target policy and the behavioural policies are the same, but different in Off-policy learning. Off-policy learning enables continuous exploration resulting in learning an optimal policy, whereas on-policy learning can only offer learning sub-optimal policy.

In this paper, we consider widely used Q-Learning, which represents a model-free off-policy learning. Q-Learning maintains a Q-table which contains Q-values for each state, and action combination. This Q-table is updated each time the agent receives a reward from the environment. A straight forward way to store this information would be in a Table I.

TABLE I: Fundamental Structure of Q-Table
state action Q(state,action)Q(state,action)

Q-values from the Q-table are used for the action selection process. A policy takes Q-values as input and outputs the action to be selected by the agent. Some of the policies are Epsilon-Greedy policy [25], Boltzmann Q Policy, Max-Boltzmann Q Policy [26], and Boltzmann-Gumbel exploration Policy [27].

The main disadvantage of Q-Learning is that the state space should be discrete and the Q-table cannot store Q-values for continues state space. Another disadvantage is that the Q-table grows larger with increasing state space which will not be manageable at a certain point. This is known as “curse of dimensionality” [26] and the complexity of evaluating a policy scales up with O(n3)O(n^{3}) when nn is the number of states in a problem. Deep Q-Learning offers a solution to these challenges.

2.1.2 Deep Q-Learning

A Neural Network can be used to approximate the Q-values based on the state as input. This is more tractable than storing every possible state in a table like Table I.

Q=neural_network.predict(state)Q=neural\_network.predict(state) (1)

When the problems and states are becoming more complex, the Neural Network may need to be “deep”, meaning a few hidden layers may not suffice to capture all the intricate details of that knowledge, hence the use of Deep Neural Networks (DNNs). Studies were carried out to incorporate DNNs to Q-Learning by replacing the Q-table with a DNN known as deep Q-Learning, and the involvement of Deep Learning to Reinforcement Learning is now known as Deep Reinforcement Learning.

Deep Q-Learning process uses two neural networks models: the inferring model and the target networks. These networks have the same architecture but different weights. Every NN steps, the weights from the inferring network are copied to the target network. Using both of these networks leads to more stability in the learning process and helps the algorithm to learn more effectively.

2.1.3 Warm-up period

Since the RL agent learns only by interacting with the environment and the gained reward, the RL agent needs a set of experiences to start training for the experience replay. A parameter nb_warm_up{nb\_warm\_up} is used to define the number of warm-up steps to be performed before initiating RL training. During this period, actions are selected totally through a random function for a given state, and no Q-value is involved. The state, action, reward, and next state are stored in the memory buffer and are used to sample the experience replay.

2.1.4 Q-Learning Policies

Q-Learning policies are responsible for deciding the best action to be selected based on the Q-values as their input. Exploration and exploitation are used to improve the experience of the experience replay by involving randomness to the action selection. Epsilon-Greedy policy, Max-Boltzmann Q-Policy, and Linear Annealed wrapper on Epsilon-Greedy policy are some of the Q-Learning policies used today [28, 29].

The Epsilon-Greedy policy adopts a greedy algorithm to select an action out from the Q values, The ϵ\epsilon value of this policy determines the exploitation and exploration ratio. 1ϵ1-\epsilon is the probability of choosing exploitation on action selection. A random action is selected by a uniform random distribution at exploration and an action with maximum Q-value is selected on exploitation. The Linear Annealed wrapper on the Epsilon-Greedy policy is changing the ϵ\epsilon value of the Epsilon-Greedy policy at each step. An ϵ\epsilon value range and the number of steps are given as parameters. This wrapper linearly changes and updates the ϵ\epsilon value of the Epsilon-Greedy policy at each step.

The Max-Boltzmann policy also uses ϵ\epsilon as a parameter. ϵ\epsilon value is considered when determining exploitation and exploration. Exploration in Max-Boltzmann policy is similar as the Epsilon-Greedy policy. At exploitation, instead of selecting the action with maximum Q-value as in Epsilon-Greedy policy; Max-Boltzmann policy randomly selects an action from a distribution which is similar to the Q-values. This introduces more randomness yet, usage of Q-values in to the action selection process.

2.2 Scope of RL in Speech Emotion Recognition

Emotion recognition from speech has gained a fair amount of attention over the past years among machine learning researchers and many studies are being carried out to improve the performance of SER from both feature extraction and emotion classification stages [30]. Hidden Markov Models, Support Vector Machines, Gaussian Mixture Models, and Artificial Neural Networks are some of the classifiers used for SER in the literature [31, 32, 33, 34]. With the tremendous success of DNN architectures, numerous studies have successfully used them and achieved good performances, e.g., [35, 36, 37, 38].

Using DNN, Supervised and guided unsupervised/self-supervised techniques are being dominantly developed for SER, however, there is still a gap in the literature for dynamically updating SER systems. Although some studies, e. g., Learning++++ [39] and Bagging++++ [40] use incremental learning/Online learning, there is a major difference between Online learning and RL. Online learning is usually used for a constant stream of data, where after once using an example, it is discarded. Whereas, RL constitutes a series of state-action pairs that either draw a positive or a negative reward. If it draws a positive reward, the entire string of actions leading up to that positive reward is reinforced, but if it draws a negative reward, the actions are penalised.

Refer to caption
Figure 1: Flow of the Zeta Policy and the connection with related components in the RL Architecture.

3 Methodology

3.1 Model Process Flow

Figure 1 shows our proposed architecture and process flow. It dominantly focuses on the process flow with the “Zeta Policy”, yet showing the inter connection between the inferring and the target model and the interaction between the agent and the environment through action and reward. Through out this section we will gradually elaborate on different components.

Inferring model is used to approximate the Q-values used in action selection for a given state by the environment. After a reward signal is received by the agent after an action is executed in the environment, attributes (state, reward, action, next state and terminal signal) are stored in the experience memory (History DB in Figure 1). After every NupdateN_{update} number of steps, a batch of samples from the experience memory is used and updates the parameters of Inferring Model. Target Model is used to approximate the QtargetQ_{target} values. Usage of QtargetQ_{target} value is explained in section 3.4.4. Parameters of the Target model is updated after NcopyN_{copy} number of steps. NupdateN_{update} and NcopyN_{copy} are hyper-parameters.

3.2 Zeta Policy

A novel RL policy: “Zeta Policy” is introduced in this study, which takes inspiration from the Boltzmann Q Policy and the Max-Boltzmann Q Policy. This policy uses the current_stepcurrent\_step (cscs) (see Figure 1) of the RL training cycle and decides on how the action selection is performed. Figure 1 shows how the action selection process is performed by the Zeta Policy with the connection through other components in the RL architecture. The Zeta_nb_step(ζ)Zeta\_nb\_step(\zeta) is a hyper-parameter to the policy and it routes to the action selection process. If cs<ζcs<\zeta, the policy follows an exploitation and exploration process, where exploitation selects the action with the highest Q-value and exploration selects a random action from a discrete uniform random distribution to include uniform randomness to the experiences. The parameter ϵ\epsilon compared with a random number nn from an uniform distribution and used to determine the exploration and exploration route. If cs>ζcs>\zeta, a random value from a distribution similar to the Q-value distribution is picked as the selected action. Experiments were carried out to find the effect of the parameters ζ\zeta and ϵ\epsilon on the performance of RL.

In the SER context, a state ss is defined as an utterance in the dataset and an action aa is the classified label (emotion) to the state ss. A reward rr is the reward returned by the environment after comparing the ground truth with the action aa. If an action (classified emotion) and ground truth are similar, i. e., the agent has inferred the correct label to the given state, the reward resembles 11 and else the reward is 1-1. The reward is accumulated throughout the episode, and the mean reward is calculated. The higher the mean reward, the better the performance of the RL agent. The standard deviation of the reward is also calculated, since this value interprets how robust the RL predictions are.

3.3 Pre-training for improved performance

Pre-training allows the RL agent’s inferring DNN to optimise its parameters and learning the features required for a similar problem. To use a pre-trained DNN in RL, we replace the softmax layer with a QQ-value output layer. As the inferring model is optimised with learnt features, there is no necessity of a warm-up period to collect experience replay. In fact, extended training time is a key shortcoming of RL [10]. One key contribution of the present paper is to use pre-training to reduce the training time by reducing the warm-up period.

Refer to caption
Figure 2: Deep Reinforcement Learning Architecture

3.4 Experimental Setup

In this section, we explain the experimental setup including feature extraction and model configuration. Figure 2 shows the experimental setup architecture of the Environment, Agent, Deep Neural Network and Policy used in this study.

3.4.1 Dataset

TABLE II: Distribution of utterances in the two considered datasets by emotion
Emotion IEMOCAP SAVEE
Happy 895 60
Sad 1200 60
Angry 1200 60
Neutral 1200 120

This study uses two popular datasets in SER: IEMOCAP [41] and SAVEE [42]. The IEMOCAP dataset features five sessions; each session includes speech segments from two speakers and is labelled with nine emotional categories. However, we use happiness, sadness, anger, and neutral for consistency with the literature. The dataset was collected from ten speakers (five male and five female). We took a maximum of 1 200 segments from each emotion for equal distribution of emotions within the dataset.

Note that the SAVEE dataset is relatively smaller compared to IEMOCAP. It is collected form 4 male speakers and has 8 labels for emotions which we filtered out keeping happiness, sadness, anger, and neutral segments for alignment with IEMOCAP and the literature.

Table II shows the utterances’ distribution of the two datasets with 30 % of each dataset being used as a subset for pre-training, and the rest being used for the RL execution.

20% of the RL execution data subset is used in the testing phase. RL Testing phase executes the whole pipeline of RL algorithm but does not update the model parameters. RL which is a different paradigm of Machine Learning than Supervised Learning does not need to have a testing dataset as it contentiously interacting with an environment. But since this specific study is related to a classification problem, we included a testing phase by proving a dataset which the RL agent has not seen before.

We remove the silent sections of the audio segments and process only the initial two seconds of the audio. For segments less than two seconds in length, we use zero padding. The reason is when using the two seconds for segment length, there will be more zero-padded segments if the segment length is increased and it becomes an identifiable feature within the dataset in the model training.

3.4.2 Feature Extraction

We use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech utterances. MFCC are widely used in speech audio analysis [35, 43]. We extract 40 MFCCs from the Mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa [44].

Refer to caption
Figure 3: Comparison across the policies Zeta policy (Zeta)(Zeta), Epsilon-Greedy policy (EpsG.)(EpsG.), Max-Boltzmann Q-policy (MaxB.)(MaxB.), and Linear Annealed wrapper on Epsilon-Greedy policy (LA)(LA) for the two datasets IEMOCAP (IEM)(IEM) and SAVEE (SAV)(SAV)
Refer to caption
Figure 4: Comparison of mean reward and standard deviation of reward with policies by changing the ϵ\epsilon value of the Zeta policy (Zeta)(Zeta), the Epsilon-Greedy policy (EpsG.)(EpsG.), the Max-Boltzmann Q-policy (MaxB.)(MaxB.), and Linear Annealed wrapper on the Epsilon-Greedy policy (LA)(LA).
Refer to caption
Figure 5: Comparison of mean reward and standard deviation of reward with policies by changing the number of steps of the Zeta Policy (Zeta)(Zeta) and the Linear Annealed wrapper on the Epsilon-Greedy policy (LA)(LA).

3.4.3 Model Recipe

A supervised learning approach is followed to identify the best suited DNN model as the inferring model. Even-though the last layer of supervised learning models output a probability vector, RL also learn the representaions in the hidden layers with similar mechanism. DNN architecture of the supervised learning model is similar to the DNN architecture of the inferring model except the output later. Activation function of the output layer of supervised later is Softmax whereas it is Linear Activation in RL. Different architectures containing Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) and Dense Layers were evaluated with similar dataset and DNN model architecture with highest testing accuracy was selected as the Inferring DNN architecture.

We use the popular Deep Learning API Keras [45] with TensorFlow [46] as the backend for modelling and training purposes in this study. We model the RL agent’s DNN with a combination of CNNs and LSTM. The use of a CNN-LSTM combined model is motivated by the ability to learn temporal and frequency components in the speech signal [47]. We stack the LSTM on a CNN layer and pass it on to a set of fully connected layers to learn discriminative features [35].

As shown in Deep Neural Network component in Figure 2, the initial layers of the inferring DNN model are two 2D-convolutional layers with filter sizes 5 and 3, respectively. Batch Normalisation is applied after the first 2D convolutional layer. The output from the second 2D convolutional layer is passed on to an LSTM layer of 16 cells, then to a fully connected layer of 265 units. A dropout layer of rate 0.3 is applied before the dense layer which outputs the Q-values. Number of outputs in the dense layer is equal to the the number of actions which is number of classes of classification. Activation function of the last layer is kept as linear function since the output Q-values should not be normalised. The input shape of the model is 40×8740\times 87, where 40 is the number of MFCCs and 87 is the number of frames in the MFCC spectrum. An Adam optimiser with a learning rate 0.00025 is used when compiling the model.

The output layer of the neural network model returns an array of Q-values corresponding to each action for a given state as input. All the Q learning policies used in the paper (Zeta, Epsilon-Greedy, Max-Boltzmann Q Policy, and Linear Annealed wrapper on Epsilon-Greedy) then use these output Q-values for corresponding action selection. We use the number of steps on executing Inferring model update (NupdateN_{update}) equal to 4 and number of steps on executing parameter copy from Inferring model to Target model (NcopyN_{copy}) equal to 10 000.

Supervised learning is used to pre-train the inferring model. A DNN which is similar architecture of the inferring model, but with Softmax activation function at output layer is trained with pre-training data subset before starting the RL execution. The model is trained for 64 epochs with a batch size 128. Once the pre-training is completed parameters of the pre-trained DNN model is copied to the inferring and target models in the RL Agent.

3.4.4 Model parameter update

Mathematical formulation of Q-Learning is based on the popular Bellman’s Equation for State-Value function (Eq 2). In (2), v(s)v(s) is the value at state ss, RtR_{t} is the reward at time tt and γ\gamma is the discount factor.

v(s)=E[Rt+1+γv(st+1)|St=s]v(s)=E[R_{t+1}+\gamma v(s_{t+1})|S_{t}=s] (2)

Equation (2) can be re-written to obtain the Bellman’s State-Action Value function known as Q-function as follows;

qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)|St=s,At=a]q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})|S_{t}=s,A_{t}=a] (3)

Here qπ(s,a)q_{\pi}(s,a) is the Q-value of the state ss following action aa under the policy π\pi.

We use a DNN to approximate the Q-values and qπ(s,a)q_{\pi}(s,a) is considered as the target Q-value giving the loss function (4)

L=Σ(QtargetQ)2L=\Sigma{(Q_{target}-Q)^{2}} (4)
Qtarget=R(st+1,at+1)+γQ(st+1,at+1)Q_{target}=R(s_{t+1},a_{t+1})+\gamma Q(s_{t+1},a_{t+1}) (5)

Combining Equations (4) and (5), updated loss function can be written as;

L=Σ(R(st+1,at+1)+γQ(st+1,at+1)Q(st,at))2L=\Sigma{(R(s_{t+1},a_{t+1})+\gamma Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t}))^{2}} (6)

Minimising the Loss value LL is the optimisation problem solved in the training phase.

QtargetQ_{target} is obtained by inferring the state ss from the target network via the function (1). The output layer of the DNN model is a Dense layer with Linear activation function. Adopting the feed-forward pass in deep neural networks to the last layer ll, Q value for an action aa can be obtained by equation (7),

Qa=Wlxl1+blQ_{a}=W^{l}x^{l-1}+b^{l} (7)

where xl1x^{l-1} is the input for the layer ll. Backpropergation pass in parameter optimising updates the WlW^{l} and blb^{l} values in each later ll to minimise the Loss value LL.

4 Evaluation

Experiments were carried out to evaluate the efficiency of the newly proposed policy “Zeta-Policy” and the improvement that can be gained by introducing the pre-training aspect to the RL paradigm.

4.1 Comparison of Policies

The key contribution presented in this article is the Zeta Policy. Therefore, we first compare its performance with that of three commonly used policies: the Epsilon-Greedy policy (EpsG.), the Max-Boltzmann Q Policy (MaxB.), and Linear Annealed wrapper on Epsilon-Greedy policy (LA). Figure 3 presents the results of these comparisons. Standard deviation and mean reward are plotted against the step number.

We notice from Figure 3 that the Zeta policy has a lower standard deviation of the reward, which suggests that the robustness of the selected actions (i. e., classified labels/emotions) of the Zeta policy is higher than that of the compared policies.

The Zeta policy outperforms the other compared policies with higher mean reward. The mean reward of the Zeta policy converges to a value around 0.78 for the IEMOCAP dataset and 0.7 for the MSP-IMPROV dataset. These values are higher than that of other policies compared, which means that the RL Agent selects the actions more correctly than other policies. Since the inferring DNN model of the RL Agent is the same for all experiments, we can infer that the Zeta policy has played a major role in this out-performance.

Refer to caption
Figure 6: Effect of the warm-up period and pre-training (p/t) on the performance.
Refer to caption
Figure 7: Performance impact on pre-training (p/t) and cross dataset pre-training by the IEMOCAP (IEM)(IEM) dataset and the SAVEE (SAV)(SAV) dataset

4.1.1 Impact of the ϵ\epsilon value

All the compared policies — Zeta Policy, Epsilon-Greedy policy, Max-Boltzmann QQ policy, and the Linear Annealed wrapper on Epsilon-Greedy policy use the parameter ϵ\epsilon in their action selection process. ϵ\epsilon is used to decide the exploration and exploitation within the policy. 1ϵ1-\epsilon is the probability of selecting exploitation out of exploration and exploitation. Hence, ϵ\epsilon ranges between 0 and 1.

Several experiments were carried out to find the range of ϵ\epsilon with the values 0.10.1, 0.050.05, 0.0250.025, and 0.01250.0125. Figure 4 shows the effect of changing the ϵ\epsilon value on the standard deviation of the reward and mean reward for all policies.

The Zeta policy shows a noticeable change in the standard deviation of the reward and mean reward for ϵ=0.0125\epsilon=0.0125. The reason being that the Zeta policy performs well in the lower ϵ\epsilon scenarios is: when cs<ζcs<\zeta, the Zeta policy picks actions with the exploration and exploitation strategy and ϵ\epsilon is the probability of exploration. A random action from a uniform random distribution is picked in the exploration strategy. A lower ϵ\epsilon means lower randomness in the period cs<ζcs<\zeta which has a higher probability of selecting an action based on the QQ-values which leads the RL algorithms to correct the false predictions.

4.1.2 Impact of the number of steps

The Zeta policy uses the parameter Zeta_nb_step(ζ)Zeta\_nb\_step(\zeta) to determine the route and the Linear Annealing uses the parameter nb_stepnb\_step to determine the gradient of the ϵ\epsilon value used in the Epsilon-Greedy component. Experiments were defined to examine the behavior of the performance of the RL agent by changing the above parameters to the values 500 000500\,000, 250 000250\,000, 100 000100\,000 and 50 00050\,000. Figure 5 was drawn with the output from experiments. Looking at the curve of the standard deviation of Zeta policy, the robustness of the RL agent has increased with the increase of the number of steps. The graph shows that the most robust curve is observed for the step size 500 000500\,000.

4.2 Pre-training for the reduced warm-up period

The warm-up period is executed in the RL algorithm to collect experiences for the RL agent to sample the experience replay. But with the pre-training, inferring DNN, the RL agent is already trained to learn the features. This leads the RL agent to produce better QQ values than an RL agent inferring a DNN based on randomly initialised parameters. An experiment was executed to identify the possibility of reducing the warm-up period after pre-training and yet keep the RL agent performance unchanged shown in Figure 6 that features the generated results. Observing both standard deviation of the reward and mean reward in Figure 6, pre-training has improved the robustness and performance of the prediction. The time taken to achieve the highest performance has reduced since the warm-up period is reduced. This makes the RL training time lower and time taken for optimising the RL agent as well.

Speed-up of the training period by pre-training was calculated by considering the number of steps needed to reach the mean reward value of 0.6. Mean number of steps taken to reach mean reward of 0.6 without pre-training was 77126 whilst with pre-training (warm-up 10000) was 48623. The speed-up of training period by pre-training and reducing warm-up steps was 1.63x.

4.3 Cross-Dataset pre-training

In our envisioned scenario an agent, although pre-trained with one corpus, is expected to be robust to other corpus/dialects. In order to experiment the behaviour of cross dataset pre-training on RL, we pre-trained the RL Agent with SAVEE pre-train data subset for the IEMOCAP RL agent and plotted the reward curve and standard deviation of the reward curve in Figure 7. The graph shows that the pre-training has always improved the performance of the RL agent and cross dataset pre-training has not degraded the performance drastically. This practice can be used in real-world applications with RL implementations, where there is a lower number of training data available. The RL agent can be pre-trained with a dataset which is aligned with the problem and deployed for act with the real environment.

4.4 Accuracy of the predictions

Accuracy of a machine learning model is a popular attribute that uses to benchmark against the other parallel models. Since RL is a method of dynamic programming, it does not comprise of accuracy attribute. But, as this specific study is focused on a classification problem, we calculated the accuracy of the RL agent with the logs of the environment. Equation 8 is used to calculate the accuracy value of an episode. Accuracy value of the testing phase after RL execution of 700000 steps was calculated and tabulated in the Table III.

accuracy=No. correct inferencesNo. of utterances×100%\text{accuracy}=\frac{\text{No. correct inferences}}{\text{No. of utterances}}\times 100\% (8)
TABLE III: Testing accuracy values of the two datasets IEMOCAP and SAVEE under each policy after 700000 steps of RL training.
Policy\Dataset IEMOCAP SAVEE
Zeta 54.29 ±\pm 2.50 68.90 ±\pm 0.61
Max-Boltzmann 51.92±\pm 1.40 67.90 ±\pm 0.40
Epsilon-Greedy 51.45 ±\pm 0.74 58.94 ±\pm 2.85
Linear Annealed 51.72 ±\pm 0.75 62.20 ±\pm 6.10

.

Studying the Table III, it is observed that the Zeta policy outperforms the other compared policies in both datasets. Also, these results can be compared with the results of supervised learning methods even though they are diverse machine learning paradigms.

5 Conclusion

This study was carried out to discover the feasibility of using a novel reinforcement learning policy named as Zeta policy for speech emotion recognition problems. Pre-training the RL agent was also studied to reduce the training time and minimise the warm-up period. The evaluated results show that the proposed Zeta policy performs better than the existing policies. We also provided an analysis of the relevant parameters epsilonepsilon and the number of steps, which shows the operating range of these two parameters. The results also confirm that pre-training can reduce the training time to reach maximum performance by reducing the warm-up period. We show that the proposed Zeta Policy with pre-training is robust to a cross-corpus scenario. In the future, one should study a cross-language scenario and explore the feasibility of using the novel Zeta policy with other RL algorithms.

References

  • [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online]. Available: https://doi.org/10.1038/nature16961
  • [2] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in StarCraft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019. [Online]. Available: https://www.nature.com/articles/s41586-019-1724-z
  • [3] Y. Shen, C. Huang, S. Wang, Y. Tsao, H. Wang, and T. Chi, “Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6750–6754.
  • [4] R. Fakoor, X. He, I. Tashev, and S. Zarar, “Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal Quality,” arXiv:1711.10791 [cs], 2018. [Online]. Available: http://arxiv.org/abs/1711.10791
  • [5] H. Chung, H. B. Jeon, and J. G. Park, “Semi-supervised Training for Sequence-to-Sequence Speech Recognition Using Reinforcement Learning,” in Proceedings of the International Joint Conference on Neural Networks.   Institute of Electrical and Electronics Engineers Inc., 7 2020.
  • [6] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker, “Reinforcement learning for spoken dialogue systems.” in Nips, 1999, pp. 956–962.
  • [7] T. Paek, “Reinforcement learning for spoken dialogue systems: Comparing strengths and weaknesses for practical deployment,” in Proc. Dialog-on-Dialog Workshop, Interspeech, 2006.
  • [8] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, “EmoRL: Continuous Acoustic Emotion Classification using Deep Reinforcement Learning,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 4445–4450, 4 2018. [Online]. Available: http://arxiv.org/abs/1804.04053
  • [9] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A Theoretical Analysis of Deep Q-Learning,” arXiv, 1 2019. [Online]. Available: http://arxiv.org/abs/1901.00137
  • [10] G. V. d. l. Cruz Jr, Y. Du, and M. E. Taylor, “Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning,” in Adaptive Learning Agents (ALA), 2019. [Online]. Available: http://arxiv.org/abs/1709.04083
  • [11] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv, vol. 2017, no. 1708.04782, 2017.
  • [12] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep q-learning from demonstrations,” in Proceedings AAAI, 2018.
  • [13] V. Kurin, S. Nowozin, K. Hofmann, L. Beyer, and B. Leibe, “The atari grand challenge dataset,” arXiv, no. 1705.10998, 2017.
  • [14] S. Calinon, “Learning from demonstration (programming by demonstration),” Encyclopedia of Robotics, pp. 1–8, 2018.
  • [15] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
  • [16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low resource speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2013, pp. 6704–6708.
  • [17] Y. Liu and K. Kirchhoff, “Graph-based semi-supervised acoustic modeling in dnn-based speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2014, pp. 177–182.
  • [18] J. Stockholm and P. Pasquier, “Reinforcement Learning of Listener Response for Mood Classification of Audio,” in 2009 International Conference on Computational Science and Engineering, vol. 4, 2009, pp. 849–853.
  • [19] H. Yu and P. Yang, “An Emotion-Based Approach to Reinforcement Learning Reward Design,” in 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC), 2019, pp. 346–351.
  • [20] M. G. Lagoudakis and R. Parr, “Reinforcement learning as classification: leveraging modern classifiers,” ser. ICML’03.   AAAI Press, 2003, pp. 424–431.
  • [21] H. Han, K. Byun, and H. G. Kang, “A deep learning-based stress detection algorithm with speech signal,” in AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018.   New York, NY, USA: Association for Computing Machinery, Inc, 10 2018, pp. 11–15. [Online]. Available: https://dl.acm.org/doi/10.1145/3264869.3264875
  • [22] S. Latif, R. Rana, and J. Qadir, “Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness,” arXiv, 11 2018. [Online]. Available: http://arxiv.org/abs/1811.11402
  • [23] R. Rana, S. Latif, R. Gururajan, A. Gray, G. Mackenzie, G. Humphris, and J. Dunn, “Automated screening for distress: A perspective for the future,” European Journal of Cancer Care, vol. 28, no. 4, 7 2019. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/30883964/
  • [24] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep Q-learning from Demonstrations,” arXiv:1704.03732 [cs], 2017. [Online]. Available: http://arxiv.org/abs/1704.03732
  • [25] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. dissertation, Cambridge, UK, 1989.
  • [26] M. Wiering, “Explorations in Efficient Reinforcement Learning,” Ph.D. dissertation, 1999. [Online]. Available: https://dare.uva.nl/search?identifier=6ac07651-85ee-4c7b-9cab-86eea5b818f4
  • [27] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu, “Boltzmann exploration done right,” ser. NIPS’17.   Curran Associates Inc., 2017, pp. 6287–6296.
  • [28] L. Pan, Q. Cai, Q. Meng, W. Chen, and L. Huang, “Reinforcement learning with dynamic boltzmann softmax updates,” in IJCAI International Joint Conference on Artificial Intelligence, vol. 2021-January.   International Joint Conferences on Artificial Intelligence, 7 2020, pp. 1992–1998. [Online]. Available: https://www.ijcai.org/proceedings/2020/276
  • [29] F. Leibfried and P. Vrancx, “Model-Based Regularization for Deep Reinforcement Learning with Transcoder Networks,” arXiv, 9 2018. [Online]. Available: http://arxiv.org/abs/1809.01906
  • [30] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
  • [31] D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication, vol. 48, no. 9, pp. 1162–1181, 2006.
  • [32] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., vol. 2, 2003, pp. II–1.
  • [33] D. A. Cairns and J. H. L. Hansen, “Nonlinear analysis and classification of speech under stressed conditions,” The Journal of the Acoustical Society of America, vol. 96, no. 6, pp. 3392–3400, 1994. [Online]. Available: https://asa.scitation.org/doi/10.1121/1.410601%****␣main.bbl␣Line␣250␣****files/674/1.html
  • [34] C. Lee, S. S. Narayanan, and R. Pieraccini, “Combining acoustic and language information for emotion recognition,” 2002. [Online]. Available: https://www.semanticscholar.org/paper/Combining-acoustic-and-language-information-for-Lee-Narayanan/d5abf8adb874577dffc4038207b6b91bee0a3450
  • [35] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct Modelling of Speech Emotion from Raw Speech,” 2019. [Online]. Available: http://arxiv.org/abs/1904.03833
  • [36] K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” in Interspeech 2014, 9 2014.
  • [37] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross Corpus Speech Emotion Classification - An Effective Transfer Learning Technique,” 2018.
  • [38] K. Y. Huang, C. H. Wu, Q. B. Hong, M. H. Su, and Y. H. Chen, “Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May.   Institute of Electrical and Electronics Engineers Inc., 5 2019, pp. 5866–5870.
  • [39] R. Polikar, L. Upda, S. S. Upda, and V. Honavar, “Learn++: an incremental learning algorithm for supervised neural networks,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 31, no. 4, pp. 497–508, 2001.
  • [40] Q. L. Zhao, Y. H. Jiang, and M. Xu, “Incremental learning by heterogeneous bagging ensemble,” in International Conference on Advanced Data Mining and Applications.   Springer, 2010, pp. 1–12.
  • [41] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, p. 335, 2008. [Online]. Available: https://doi.org/10.1007/s10579-008-9076-6
  • [42] S. Haq, P. J. B. Jackson, and J. Edge, “Audio-visual feature selection and reduction for emotion classification,” 2008.
  • [43] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
  • [44] B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” vol. 8, 2015.
  • [45] F. Chollet and others, Keras, 2015. [Online]. Available: https://keras.io
  • [46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. [Online]. Available: https://www.tensorflow.org/
  • [47] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends,” arXiv:2001.00378 [cs, eess], 2020. [Online]. Available: http://arxiv.org/abs/2001.00378
[Uncaptioned image] Thejan Rajapakshe received bachelor degree in Applied Sciences from Rajarata University of Sri Lanka, Mihintale and Bachelor of Information Technology from University of Colombo, School of Computing in 2016. He is currently a research scholar at the University of Southern Queensland (USQ). He is also an Associate Technical Team Lead at CodeGen International - Research & Development. His research interests include reinforcement learning, speech processing and deep learning.
[Uncaptioned image] Rajib Rana is an experimental computer scientist, Advance Queensland Research Fellow and a Senior Lecturer in the University of Southern Queensland. He is also the Director of IoT Health research program at the University of Southern Queensland. He is the recipient of the prestigious Young Tall Poppy QLD Award 2018 as one of Queensland’s most outstanding scientists for achievements in the area of scientific research and communication. Rana’s research work aims to capitalise on advancements in technology along with sophisticated information and data processing to better understand disease progression in chronic health conditions and develop predictive algorithms for chronic diseases, such as mental illness and cancer. His current research focus is on Unsupervised Representation Learning. He received his B.Sc. degree in Computer Science and Engineering from Khulna University, Bangladesh with Prime Minister and President’s Gold medal for outstanding achievements and Ph.D. in Computer Science and Engineering from the University of New South Wales, Sydney, Australia in 2011. He received his postdoctoral training at Autonomous Systems Laboratory, CSIRO before joining the University of Southern Queensland as Faculty in 2015.
[Uncaptioned image] Sara Khalifa is currently a senior research scientist at the Distributed Sensing Systems research group, Data61—CSIRO. She is also an honorary adjunct lecturer at University of Queensland and conjoint lecturer at University of New South Wales. Her research interests rotate around the broad aspects of mobile and ubiquitous computing, mobile sensing and Internet of Things (IoT). She obtained a PhD in Computer Science and Engineering from UNSW (Sydney, Australia). Her PhD dissertation received the 2017 John Makepeace Bennett Award which is awarded by CORE (Computing Research and Education Association of Australasia) to the best PhD dissertation of the year within Australia and New Zealand in the field of Computer Science. Her research has been recognised by multiple iAwards including 2017 NSW Mobility Innovation of the year, 2017 NSW R&D Innovation of the year, National Merit R&D Innovation of the year, and the Merit R&D award at the Asia Pacific ICT Alliance (APICTA) Awards, commonly known as the ”Oscar” of the ICT industry in the Asia Pacific, among others.
[Uncaptioned image] Björn W. Schuller received his diploma in 1999, his doctoral degree for his study on Automatic Speech and Emotion Recognition in 2006, and his habilitation and Adjunct Teaching Professorship in the subject area of Signal Processing and Machine Intelligence in 2012, all in electrical engineering and information technology from TUM in Munich/Germany. He is Professor of Artificial Intelligence in the Department of Computing at the Imperial College London/UK, where he heads GLAM — the Group on Language, Audio & Music, Full Professor and head of the ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, and CEO of audEERING. He was previously full professor and head of the Chair of Complex and Intelligent Systems at the University of Passau/Germany. Professor Schuller is Fellow of the IEEE, Golden Core Member of the IEEE Computer Society, Senior Member of the ACM, President-emeritus of the Association for the Advancement of Affective Computing (AAAC), and was elected member of the IEEE Speech and Language Processing Technical Committee. He (co-)authored 5 books and more than 800 publications in peer-reviewed books, journals, and conference proceedings leading to more than overall 25 000 citations (h-index = 73). Schuller is general chair of ACII 2019, co-Program Chair of Interspeech 2019 and ICMI 2019, repeated Area Chair of ICASSP, and former Editor in Chief of the IEEE Transactions on Affective Computing next to a multitude of further Associate and Guest Editor roles and functions in Technical and Organisational Committees.
[Uncaptioned image] Jiajun Liu is a science leader at CSIRO, Australia. His current research interests include multimedia content analysis, indexing, and retrieval. He received the B.E. degree from Nanjing University, China, and the Ph.D. degree from the University of Queensland, Brisbane, QLD, Australia, in 2013. He was also a Researcher/Software Engineer for IBM, China, during 2006 to 2008.