A Novel Policy for Pre-trained
Deep Reinforcement Learning for
Speech Emotion Recognition

Thejan Rajapakshe12, Rajib Rana2, Sara Khalifa3 , Björn W. Schuller46, Jiajun Liu3
2University of Southern Queensland, Australia
3Distributed Sensing Systems Group, Data61, CSIRO Australia
4 GLAM – Group on Language, Audio & Music, Imperial College London, UK
6ZD.B Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany
1 [email protected]

Abstract

Reinforcement Learning (RL) is a semi-supervised learning paradigm where an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming – such as AlphaGo, but its potential has rarely being explored for challenging tasks like Speech Emotion Recognition (SER). Deep RL being used for SER can potentially improve the performance of an automated call centre agent by dynamically learning emotion-aware responses to customer queries. While the policy employed by the RL agent plays a major role in action selection, there is no current RL policy tailored for SER. In addition, an extended learning period is a general challenge for deep RL, which can impact the speed of learning for SER. Therefore, in this paper, we introduce a novel policy – the “Zeta policy” which is tailored for SER and apply pre-training in deep RL to achieve a faster learning rate. Pre-training with a cross dataset was also studied to discover the feasibility of pre-training the RL agent with a similar dataset in a scenario where real environmental data is not available. The IEMOCAP and SAVEE datasets were used for the evaluation with the problem being to recognise the four emotions happy, sad, angry, and neutral in the utterances provided. The experimental results show that the proposed “Zeta policy” performs better than existing policies. They also support that pre-training can reduce the training time and is robust to a cross-corpus scenario.

Index Terms:

Machine Learning, Deep Reinforcement Learning, Speech Emotion Recognition

1 Introduction

Reinforcement Learning (RL), a semi-supervised machine learning technique, allows an agent to take actions and interact with an environment to maximise the total rewards. RL's main popularity comes from its super-human performance in solving some games like AlphaGo [1] and AlphaStar [2].

RL has been also employed for audio-based applications showing its potential for audio enhancement [3, 4], automatic speech recognition [5], and spoken dialogue systems [6, 7]. The potential of using RL for speech emotion recognition has been recently demonstrated for robot applications where the robot can detect an unsafe situation earlier given a human utterance [8]. An emotion detection agent was trained to achieve an accuracy-latency trade-off by punishing wrong classifications as well as too late predictions through the reward function. Motivated by this recent study, we employ RL for speech emotion recognition where a potential application of the proposed system could be an intelligent call centre agent, learning over time how to communicate with human customers in an emotionally intelligent way. We consider the speech utterance as “state” and the classified emotion as “action”. We consider a correct classification as a positive reward and negative reward, otherwise.

RL widely uses Q-learning, a simple, yet quite powerful algorithm to create a state-action mapping, namely Q-table, for the agent. This, however, is intractable for a large or continuous state and/or action space. First, the amount of memory required to save and update that table would increase as the number of states increases. Second, the amount of time required to explore each state to create the required Q-table would be unrealistic. To address these issues, the deep Q-Learning algorithm uses a neural network to approximate a Q-value function. Deep Q-Learning is an important algorithm enabling deep Reinforcement Learning (deep RL) [9].

The standard deep Q-learning algorithm employs a stochastic exploration strategy called $\epsilon$ -greedy, which follows a greedy policy according to the current Q-value estimate and chooses a random action with probability $\epsilon$ . Since the application of RL in speech emotion recognition (SER) is mostly unexplored, there is not enough evidence if the $\epsilon$ -greedy policy is best suited for SER. In this article, we investigate the feasibility of a tailored policy for SER. We propose a Zeta-policy ¹¹1Code: https://github.com/jayaneetha/Zeta-Policy and provide analysis supporting its superior performance compared to $\epsilon$ -greedy and some other popular policies.

A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach a reasonable accuracy, making it inapplicable in real-world settings [10]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD)) in RL has recently gained traction as a possible way of speeding up deep RL [11, 12, 13]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on, learns to surpass the demonstrator [10]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [14]. Therefore, LfDs are generally not scalable, especially for high-dimensional problems.

We propose the technique of pre-training the underlying deep neural networks to speed up training in deep RL. It enables the RL agent to learn better features leading to better performance without changing the policy learning strategies [10]. In supervised methods, pre-training helps regularisation and enables faster convergence compared to randomly initialised networks [15]. Various studies (e. g., [16, 17]) have explored pre-training in speech recognition and achieved improved results. However, pre-training in deep RL is hardly explored in the area of speech emotion recognition. In this paper, we present the analysis showing that pre-training can reduce the training time. In our envisioned scenario, the agent might be trained with one corpus but might need to interact with other corpora. To test the performance in those scenarios, we also analyse performance for cross-corpus pre-training.

2 Related Work and Background

Reinforcement learning has not been widely explored for speech emotion recognition. The closest match of our work is EmoRL where the authors use deep RL to determine the best position to split the utterance to send to the emotion recognition model [8]. The RL agent chooses to decide an “action” whether to wait for more audio data or terminate and trigger prediction. Once the terminate action is selected, the agent stops processing the audio stream and starts classifying the emotion. The authors aim to achieve a trade-off between accuracy and latency by penalising wrong classifications as well delayed predictions through rewards. In contrast, our focus is on developing a new policy tailored for SER, and apply pre-training to achieve faster learning rate.

Another related study has used RL to develop a music recommendation system based on the mood of the listener. The users select their current mood by selecting “pleasure” and “energy”, then, the application selects a song out of its repository. The users can provide feedback on the system recommended audio by answering the question “Does this audio match the mood you set?” [18]. Here, the key focus is to learn the mapping of a song to the selected mood, however, in this article, we focus on the automatic determination of the emotion. Yu and Yang introduced an emotion-based target reward function [19], which again did not have a focus on SER.

A group of studies used RL for audio enhancement. Shen et al. showed that introducing RL for enhancing the audio utterances can reduce the testing phase character error rate by about 7 % of automatic speech recognition with noise [3]. Fakoor et al. also used RL for speech enhancement while considering the noise suppression as a black box and only taking the feedback from the output as the reward. They achieved an increase of 42 % of the signal to noise ratio of the output [4].

On the topic of proposing new policies, Lagoudakis and Parr used a modification of approximate policy iteration for pendulum balancing and bicycle riding domains [20]. We propose a policy which is tailored for SER. Many studies can be found in the literature using Deep learning to capture emotion and related features from speech signal, however, none of them had a focus on RL [21, 22, 23].

An important aspect of our study is incorporating pre-training in RL to achieve a faster learning rate in SER. Gabriel et al. used human demonstrations of playing Atari games to pre-train a deep RL agent to improve the training time of the RL agent [10]. Hester et al. in their introduction of Deep Q-Learning from Demonstrations, presented prior recorded demonstrations to a deep RL system of playing 42 games – 41 of them had improved performance and 11 out of them achieved state-of-the-art performance [24]. An evidential advance in performance and accuracy is observed in their results of the case study with Speech Commands Dataset. None of these studies focuses on using pre-training in RL for SER, which is our focus.

2.1 Reinforcement Learning

RL architecture mainly consist of two major components namely “Environment” and “Agent” while there are three major signals passing between those two agents as “current state”, “reward” and “next state”. An agent interacts with an unknown environment and observes a state $s_{t}\in S$ at every time step $t\in[0,T]$ , where $S$ is the state space and $T$ is the terminal time. The agent selects an action ${a\in A}$ , where $A$ is the action space and then, the environment changes to $s_{t+1}$ and the agent receives a reward scalar $r_{t+1}$ which represents a feedback on how good the selected action on the environment is. The agent learns a policy $\pi$ to map most actions $a$ for a given state $s$ . The objective of RL is to learn $\pi^{*}$ , an optimal policy which maximises the cumulative reward that can map the most suitable actions to a given state $s$ .

2.1.1 Q-Learning

Before introducing Q-Learning, introduction to a number of key terms is necessary.

Model Free and Model based Policy: The model learns the transition probability from the current state for an action to the next state. Although it is straight forward, model-based algorithms become impractical as the state space and action space grows. In contrary, model-free algorithms rely on trial-and-error to update its knowledge and does not require to store all the combination of states and actions.

On-Policy learning and Off-policy learning: These two policies can be described in-terms of Target Policy and Behaviour Policy. Target Policy is the policy, which an agent tries to learn by learning value function. Behavior Policy is used by an agent for action selection or in other words used to interact with the environment.

In On-policy learning, target policy and the behavioural policies are the same, but different in Off-policy learning. Off-policy learning enables continuous exploration resulting in learning an optimal policy, whereas on-policy learning can only offer learning sub-optimal policy.

In this paper, we consider widely used Q-Learning, which represents a model-free off-policy learning. Q-Learning maintains a Q-table which contains Q-values for each state, and action combination. This Q-table is updated each time the agent receives a reward from the environment. A straight forward way to store this information would be in a Table I.

TABLE I: Fundamental Structure of Q-Table

state	action	$Q(state,action)$
…	…	…

Q-values from the Q-table are used for the action selection process. A policy takes Q-values as input and outputs the action to be selected by the agent. Some of the policies are Epsilon-Greedy policy [25], Boltzmann Q Policy, Max-Boltzmann Q Policy [26], and Boltzmann-Gumbel exploration Policy [27].

The main disadvantage of Q-Learning is that the state space should be discrete and the Q-table cannot store Q-values for continues state space. Another disadvantage is that the Q-table grows larger with increasing state space which will not be manageable at a certain point. This is known as “curse of dimensionality” [26] and the complexity of evaluating a policy scales up with $O(n^{3})$ when $n$ is the number of states in a problem. Deep Q-Learning offers a solution to these challenges.

2.1.2 Deep Q-Learning

A Neural Network can be used to approximate the Q-values based on the state as input. This is more tractable than storing every possible state in a table like Table I.

Q=neural\_network.predict(state)

(1)

When the problems and states are becoming more complex, the Neural Network may need to be “deep”, meaning a few hidden layers may not suffice to capture all the intricate details of that knowledge, hence the use of Deep Neural Networks (DNNs). Studies were carried out to incorporate DNNs to Q-Learning by replacing the Q-table with a DNN known as deep Q-Learning, and the involvement of Deep Learning to Reinforcement Learning is now known as Deep Reinforcement Learning.

Deep Q-Learning process uses two neural networks models: the inferring model and the target networks. These networks have the same architecture but different weights. Every $N$ steps, the weights from the inferring network are copied to the target network. Using both of these networks leads to more stability in the learning process and helps the algorithm to learn more effectively.

2.1.3 Warm-up period

Since the RL agent learns only by interacting with the environment and the gained reward, the RL agent needs a set of experiences to start training for the experience replay. A parameter ${nb\_warm\_up}$ is used to define the number of warm-up steps to be performed before initiating RL training. During this period, actions are selected totally through a random function for a given state, and no Q-value is involved. The state, action, reward, and next state are stored in the memory buffer and are used to sample the experience replay.

2.1.4 Q-Learning Policies

Q-Learning policies are responsible for deciding the best action to be selected based on the Q-values as their input. Exploration and exploitation are used to improve the experience of the experience replay by involving randomness to the action selection. Epsilon-Greedy policy, Max-Boltzmann Q-Policy, and Linear Annealed wrapper on Epsilon-Greedy policy are some of the Q-Learning policies used today [28, 29].

The Epsilon-Greedy policy adopts a greedy algorithm to select an action out from the Q values, The $\epsilon$ value of this policy determines the exploitation and exploration ratio. $1-\epsilon$ is the probability of choosing exploitation on action selection. A random action is selected by a uniform random distribution at exploration and an action with maximum Q-value is selected on exploitation. The Linear Annealed wrapper on the Epsilon-Greedy policy is changing the $\epsilon$ value of the Epsilon-Greedy policy at each step. An $\epsilon$ value range and the number of steps are given as parameters. This wrapper linearly changes and updates the $\epsilon$ value of the Epsilon-Greedy policy at each step.

The Max-Boltzmann policy also uses $\epsilon$ as a parameter. $\epsilon$ value is considered when determining exploitation and exploration. Exploration in Max-Boltzmann policy is similar as the Epsilon-Greedy policy. At exploitation, instead of selecting the action with maximum Q-value as in Epsilon-Greedy policy; Max-Boltzmann policy randomly selects an action from a distribution which is similar to the Q-values. This introduces more randomness yet, usage of Q-values in to the action selection process.

2.2 Scope of RL in Speech Emotion Recognition

Emotion recognition from speech has gained a fair amount of attention over the past years among machine learning researchers and many studies are being carried out to improve the performance of SER from both feature extraction and emotion classification stages [30]. Hidden Markov Models, Support Vector Machines, Gaussian Mixture Models, and Artificial Neural Networks are some of the classifiers used for SER in the literature [31, 32, 33, 34]. With the tremendous success of DNN architectures, numerous studies have successfully used them and achieved good performances, e.g., [35, 36, 37, 38].

Using DNN, Supervised and guided unsupervised/self-supervised techniques are being dominantly developed for SER, however, there is still a gap in the literature for dynamically updating SER systems. Although some studies, e. g., Learning $++$ [39] and Bagging $++$ [40] use incremental learning/Online learning, there is a major difference between Online learning and RL. Online learning is usually used for a constant stream of data, where after once using an example, it is discarded. Whereas, RL constitutes a series of state-action pairs that either draw a positive or a negative reward. If it draws a positive reward, the entire string of actions leading up to that positive reward is reinforced, but if it draws a negative reward, the actions are penalised.

Refer to caption — Figure 1: Flow of the Zeta Policy and the connection with related components in the RL Architecture.

3 Methodology

3.1 Model Process Flow

Figure 1 shows our proposed architecture and process flow. It dominantly focuses on the process flow with the “Zeta Policy”, yet showing the inter connection between the inferring and the target model and the interaction between the agent and the environment through action and reward. Through out this section we will gradually elaborate on different components.

Inferring model is used to approximate the Q-values used in action selection for a given state by the environment. After a reward signal is received by the agent after an action is executed in the environment, attributes (state, reward, action, next state and terminal signal) are stored in the experience memory (History DB in Figure 1). After every $N_{update}$ number of steps, a batch of samples from the experience memory is used and updates the parameters of Inferring Model. Target Model is used to approximate the $Q_{target}$ values. Usage of $Q_{target}$ value is explained in section 3.4.4. Parameters of the Target model is updated after $N_{copy}$ number of steps. $N_{update}$ and $N_{copy}$ are hyper-parameters.

3.2 Zeta Policy

A novel RL policy: “Zeta Policy” is introduced in this study, which takes inspiration from the Boltzmann Q Policy and the Max-Boltzmann Q Policy. This policy uses the $current\_step$ ( $cs$ ) (see Figure 1) of the RL training cycle and decides on how the action selection is performed. Figure 1 shows how the action selection process is performed by the Zeta Policy with the connection through other components in the RL architecture. The $Zeta\_nb\_step(\zeta)$ is a hyper-parameter to the policy and it routes to the action selection process. If $cs<\zeta$ , the policy follows an exploitation and exploration process, where exploitation selects the action with the highest Q-value and exploration selects a random action from a discrete uniform random distribution to include uniform randomness to the experiences. The parameter $\epsilon$ compared with a random number $n$ from an uniform distribution and used to determine the exploration and exploration route. If $cs>\zeta$ , a random value from a distribution similar to the Q-value distribution is picked as the selected action. Experiments were carried out to find the effect of the parameters $\zeta$ and $\epsilon$ on the performance of RL.

In the SER context, a state $s$ is defined as an utterance in the dataset and an action $a$ is the classified label (emotion) to the state $s$ . A reward $r$ is the reward returned by the environment after comparing the ground truth with the action $a$ . If an action (classified emotion) and ground truth are similar, i. e., the agent has inferred the correct label to the given state, the reward resembles $1$ and else the reward is $-1$ . The reward is accumulated throughout the episode, and the mean reward is calculated. The higher the mean reward, the better the performance of the RL agent. The standard deviation of the reward is also calculated, since this value interprets how robust the RL predictions are.

3.3 Pre-training for improved performance

Pre-training allows the RL agent’s inferring DNN to optimise its parameters and learning the features required for a similar problem. To use a pre-trained DNN in RL, we replace the softmax layer with a $Q$ -value output layer. As the inferring model is optimised with learnt features, there is no necessity of a warm-up period to collect experience replay. In fact, extended training time is a key shortcoming of RL [10]. One key contribution of the present paper is to use pre-training to reduce the training time by reducing the warm-up period.

3.4 Experimental Setup

In this section, we explain the experimental setup including feature extraction and model configuration. Figure 2 shows the experimental setup architecture of the Environment, Agent, Deep Neural Network and Policy used in this study.

3.4.1 Dataset

TABLE II: Distribution of utterances in the two considered datasets by emotion

Emotion	IEMOCAP	SAVEE
Happy	895	60
Sad	1200	60
Angry	1200	60
Neutral	1200	120

This study uses two popular datasets in SER: IEMOCAP [41] and SAVEE [42]. The IEMOCAP dataset features five sessions; each session includes speech segments from two speakers and is labelled with nine emotional categories. However, we use happiness, sadness, anger, and neutral for consistency with the literature. The dataset was collected from ten speakers (five male and five female). We took a maximum of 1 200 segments from each emotion for equal distribution of emotions within the dataset.

Note that the SAVEE dataset is relatively smaller compared to IEMOCAP. It is collected form 4 male speakers and has 8 labels for emotions which we filtered out keeping happiness, sadness, anger, and neutral segments for alignment with IEMOCAP and the literature.

Table II shows the utterances’ distribution of the two datasets with 30 % of each dataset being used as a subset for pre-training, and the rest being used for the RL execution.

20% of the RL execution data subset is used in the testing phase. RL Testing phase executes the whole pipeline of RL algorithm but does not update the model parameters. RL which is a different paradigm of Machine Learning than Supervised Learning does not need to have a testing dataset as it contentiously interacting with an environment. But since this specific study is related to a classification problem, we included a testing phase by proving a dataset which the RL agent has not seen before.

We remove the silent sections of the audio segments and process only the initial two seconds of the audio. For segments less than two seconds in length, we use zero padding. The reason is when using the two seconds for segment length, there will be more zero-padded segments if the segment length is increased and it becomes an identifiable feature within the dataset in the model training.

3.4.2 Feature Extraction

We use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech utterances. MFCC are widely used in speech audio analysis [35, 43]. We extract 40 MFCCs from the Mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa [44].

3.4.3 Model Recipe

A supervised learning approach is followed to identify the best suited DNN model as the inferring model. Even-though the last layer of supervised learning models output a probability vector, RL also learn the representaions in the hidden layers with similar mechanism. DNN architecture of the supervised learning model is similar to the DNN architecture of the inferring model except the output later. Activation function of the output layer of supervised later is Softmax whereas it is Linear Activation in RL. Different architectures containing Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) and Dense Layers were evaluated with similar dataset and DNN model architecture with highest testing accuracy was selected as the Inferring DNN architecture.

We use the popular Deep Learning API Keras [45] with TensorFlow [46] as the backend for modelling and training purposes in this study. We model the RL agent’s DNN with a combination of CNNs and LSTM. The use of a CNN-LSTM combined model is motivated by the ability to learn temporal and frequency components in the speech signal [47]. We stack the LSTM on a CNN layer and pass it on to a set of fully connected layers to learn discriminative features [35].

As shown in Deep Neural Network component in Figure 2, the initial layers of the inferring DNN model are two 2D-convolutional layers with filter sizes 5 and 3, respectively. Batch Normalisation is applied after the first 2D convolutional layer. The output from the second 2D convolutional layer is passed on to an LSTM layer of 16 cells, then to a fully connected layer of 265 units. A dropout layer of rate 0.3 is applied before the dense layer which outputs the Q-values. Number of outputs in the dense layer is equal to the the number of actions which is number of classes of classification. Activation function of the last layer is kept as linear function since the output Q-values should not be normalised. The input shape of the model is $40\times 87$ , where 40 is the number of MFCCs and 87 is the number of frames in the MFCC spectrum. An Adam optimiser with a learning rate 0.00025 is used when compiling the model.

The output layer of the neural network model returns an array of Q-values corresponding to each action for a given state as input. All the Q learning policies used in the paper (Zeta, Epsilon-Greedy, Max-Boltzmann Q Policy, and Linear Annealed wrapper on Epsilon-Greedy) then use these output Q-values for corresponding action selection. We use the number of steps on executing Inferring model update ( $N_{update}$ ) equal to 4 and number of steps on executing parameter copy from Inferring model to Target model ( $N_{copy}$ ) equal to 10 000.

Supervised learning is used to pre-train the inferring model. A DNN which is similar architecture of the inferring model, but with Softmax activation function at output layer is trained with pre-training data subset before starting the RL execution. The model is trained for 64 epochs with a batch size 128. Once the pre-training is completed parameters of the pre-trained DNN model is copied to the inferring and target models in the RL Agent.

3.4.4 Model parameter update

Mathematical formulation of Q-Learning is based on the popular Bellman’s Equation for State-Value function (Eq 2). In (2), $v(s)$ is the value at state $s$ , $R_{t}$ is the reward at time $t$ and $\gamma$ is the discount factor.

v(s)=E[R_{t+1}+\gamma v(s_{t+1})|S_{t}=s]

(2)

Equation (2) can be re-written to obtain the Bellman’s State-Action Value function known as Q-function as follows;

q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})|S_{t}=s,A_{t}=a]

(3)

Here $q_{\pi}(s,a)$ is the Q-value of the state $s$ following action $a$ under the policy $\pi$ .

We use a DNN to approximate the Q-values and $q_{\pi}(s,a)$ is considered as the target Q-value giving the loss function (4)

L=\Sigma{(Q_{target}-Q)^{2}}

(4)

Q_{target}=R(s_{t+1},a_{t+1})+\gamma Q(s_{t+1},a_{t+1})

(5)

Combining Equations (4) and (5), updated loss function can be written as;

L=\Sigma{(R(s_{t+1},a_{t+1})+\gamma Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t}))^{2}}

(6)

Minimising the Loss value $L$ is the optimisation problem solved in the training phase.

$Q_{target}$ is obtained by inferring the state $s$ from the target network via the function (1). The output layer of the DNN model is a Dense layer with Linear activation function. Adopting the feed-forward pass in deep neural networks to the last layer $l$ , Q value for an action $a$ can be obtained by equation (7),

Q_{a}=W^{l}x^{l-1}+b^{l}

(7)

where $x^{l-1}$ is the input for the layer $l$ . Backpropergation pass in parameter optimising updates the $W^{l}$ and $b^{l}$ values in each later $l$ to minimise the Loss value $L$ .

4 Evaluation

Experiments were carried out to evaluate the efficiency of the newly proposed policy “Zeta-Policy” and the improvement that can be gained by introducing the pre-training aspect to the RL paradigm.

4.1 Comparison of Policies

The key contribution presented in this article is the Zeta Policy. Therefore, we first compare its performance with that of three commonly used policies: the Epsilon-Greedy policy (EpsG.), the Max-Boltzmann Q Policy (MaxB.), and Linear Annealed wrapper on Epsilon-Greedy policy (LA). Figure 3 presents the results of these comparisons. Standard deviation and mean reward are plotted against the step number.

We notice from Figure 3 that the Zeta policy has a lower standard deviation of the reward, which suggests that the robustness of the selected actions (i. e., classified labels/emotions) of the Zeta policy is higher than that of the compared policies.

The Zeta policy outperforms the other compared policies with higher mean reward. The mean reward of the Zeta policy converges to a value around 0.78 for the IEMOCAP dataset and 0.7 for the MSP-IMPROV dataset. These values are higher than that of other policies compared, which means that the RL Agent selects the actions more correctly than other policies. Since the inferring DNN model of the RL Agent is the same for all experiments, we can infer that the Zeta policy has played a major role in this out-performance.

4.1.1 Impact of the $\epsilon$ value

All the compared policies — Zeta Policy, Epsilon-Greedy policy, Max-Boltzmann $Q$ policy, and the Linear Annealed wrapper on Epsilon-Greedy policy use the parameter $\epsilon$ in their action selection process. $\epsilon$ is used to decide the exploration and exploitation within the policy. $1-\epsilon$ is the probability of selecting exploitation out of exploration and exploitation. Hence, $\epsilon$ ranges between 0 and 1.

Several experiments were carried out to find the range of $\epsilon$ with the values $0.1$ , $0.05$ , $0.025$ , and $0.0125$ . Figure 4 shows the effect of changing the $\epsilon$ value on the standard deviation of the reward and mean reward for all policies.

The Zeta policy shows a noticeable change in the standard deviation of the reward and mean reward for $\epsilon=0.0125$ . The reason being that the Zeta policy performs well in the lower $\epsilon$ scenarios is: when $cs<\zeta$ , the Zeta policy picks actions with the exploration and exploitation strategy and $\epsilon$ is the probability of exploration. A random action from a uniform random distribution is picked in the exploration strategy. A lower $\epsilon$ means lower randomness in the period $cs<\zeta$ which has a higher probability of selecting an action based on the $Q$ -values which leads the RL algorithms to correct the false predictions.

4.1.2 Impact of the number of steps

The Zeta policy uses the parameter $Zeta\_nb\_step(\zeta)$ to determine the route and the Linear Annealing uses the parameter $nb\_step$ to determine the gradient of the $\epsilon$ value used in the Epsilon-Greedy component. Experiments were defined to examine the behavior of the performance of the RL agent by changing the above parameters to the values $500\,000$ , $250\,000$ , $100\,000$ and $50\,000$ . Figure 5 was drawn with the output from experiments. Looking at the curve of the standard deviation of Zeta policy, the robustness of the RL agent has increased with the increase of the number of steps. The graph shows that the most robust curve is observed for the step size $500\,000$ .

4.2 Pre-training for the reduced warm-up period

The warm-up period is executed in the RL algorithm to collect experiences for the RL agent to sample the experience replay. But with the pre-training, inferring DNN, the RL agent is already trained to learn the features. This leads the RL agent to produce better $Q$ values than an RL agent inferring a DNN based on randomly initialised parameters. An experiment was executed to identify the possibility of reducing the warm-up period after pre-training and yet keep the RL agent performance unchanged shown in Figure 6 that features the generated results. Observing both standard deviation of the reward and mean reward in Figure 6, pre-training has improved the robustness and performance of the prediction. The time taken to achieve the highest performance has reduced since the warm-up period is reduced. This makes the RL training time lower and time taken for optimising the RL agent as well.

Speed-up of the training period by pre-training was calculated by considering the number of steps needed to reach the mean reward value of 0.6. Mean number of steps taken to reach mean reward of 0.6 without pre-training was 77126 whilst with pre-training (warm-up 10000) was 48623. The speed-up of training period by pre-training and reducing warm-up steps was 1.63x.

4.3 Cross-Dataset pre-training

In our envisioned scenario an agent, although pre-trained with one corpus, is expected to be robust to other corpus/dialects. In order to experiment the behaviour of cross dataset pre-training on RL, we pre-trained the RL Agent with SAVEE pre-train data subset for the IEMOCAP RL agent and plotted the reward curve and standard deviation of the reward curve in Figure 7. The graph shows that the pre-training has always improved the performance of the RL agent and cross dataset pre-training has not degraded the performance drastically. This practice can be used in real-world applications with RL implementations, where there is a lower number of training data available. The RL agent can be pre-trained with a dataset which is aligned with the problem and deployed for act with the real environment.

4.4 Accuracy of the predictions

Accuracy of a machine learning model is a popular attribute that uses to benchmark against the other parallel models. Since RL is a method of dynamic programming, it does not comprise of accuracy attribute. But, as this specific study is focused on a classification problem, we calculated the accuracy of the RL agent with the logs of the environment. Equation 8 is used to calculate the accuracy value of an episode. Accuracy value of the testing phase after RL execution of 700000 steps was calculated and tabulated in the Table III.

\text{accuracy}=\frac{\text{No. correct inferences}}{\text{No. of utterances}}\times 100\%

(8)

TABLE III: Testing accuracy values of the two datasets IEMOCAP and SAVEE under each policy after 700000 steps of RL training.

Policy\Dataset	IEMOCAP	SAVEE
Zeta	54.29 $\pm$ 2.50	68.90 $\pm$ 0.61
Max-Boltzmann	51.92 $\pm$ 1.40	67.90 $\pm$ 0.40
Epsilon-Greedy	51.45 $\pm$ 0.74	58.94 $\pm$ 2.85
Linear Annealed	51.72 $\pm$ 0.75	62.20 $\pm$ 6.10

Studying the Table III, it is observed that the Zeta policy outperforms the other compared policies in both datasets. Also, these results can be compared with the results of supervised learning methods even though they are diverse machine learning paradigms.

5 Conclusion

This study was carried out to discover the feasibility of using a novel reinforcement learning policy named as Zeta policy for speech emotion recognition problems. Pre-training the RL agent was also studied to reduce the training time and minimise the warm-up period. The evaluated results show that the proposed Zeta policy performs better than the existing policies. We also provided an analysis of the relevant parameters $epsilon$ and the number of steps, which shows the operating range of these two parameters. The results also confirm that pre-training can reduce the training time to reach maximum performance by reducing the warm-up period. We show that the proposed Zeta Policy with pre-training is robust to a cross-corpus scenario. In the future, one should study a cross-language scenario and explore the feasibility of using the novel Zeta policy with other RL algorithms.

References

[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online]. Available: https://doi.org/10.1038/nature16961
[2] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in StarCraft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019. [Online]. Available: https://www.nature.com/articles/s41586-019-1724-z
[3] Y. Shen, C. Huang, S. Wang, Y. Tsao, H. Wang, and T. Chi, “Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6750–6754.
[4] R. Fakoor, X. He, I. Tashev, and S. Zarar, “Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal Quality,” arXiv:1711.10791 [cs], 2018. [Online]. Available: http://arxiv.org/abs/1711.10791
[5] H. Chung, H. B. Jeon, and J. G. Park, “Semi-supervised Training for Sequence-to-Sequence Speech Recognition Using Reinforcement Learning,” in Proceedings of the International Joint Conference on Neural Networks. Institute of Electrical and Electronics Engineers Inc., 7 2020.
[6] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker, “Reinforcement learning for spoken dialogue systems.” in Nips, 1999, pp. 956–962.
[7] T. Paek, “Reinforcement learning for spoken dialogue systems: Comparing strengths and weaknesses for practical deployment,” in Proc. Dialog-on-Dialog Workshop, Interspeech, 2006.
[8] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, “EmoRL: Continuous Acoustic Emotion Classification using Deep Reinforcement Learning,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 4445–4450, 4 2018. [Online]. Available: http://arxiv.org/abs/1804.04053
[9] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A Theoretical Analysis of Deep Q-Learning,” arXiv, 1 2019. [Online]. Available: http://arxiv.org/abs/1901.00137
[10] G. V. d. l. Cruz Jr, Y. Du, and M. E. Taylor, “Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning,” in Adaptive Learning Agents (ALA), 2019. [Online]. Available: http://arxiv.org/abs/1709.04083
[11] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv, vol. 2017, no. 1708.04782, 2017.
[12] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep q-learning from demonstrations,” in Proceedings AAAI, 2018.
[13] V. Kurin, S. Nowozin, K. Hofmann, L. Beyer, and B. Leibe, “The atari grand challenge dataset,” arXiv, no. 1705.10998, 2017.
[14] S. Calinon, “Learning from demonstration (programming by demonstration),” Encyclopedia of Robotics, pp. 1–8, 2018.
[15] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
[16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low resource speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6704–6708.
[17] Y. Liu and K. Kirchhoff, “Graph-based semi-supervised acoustic modeling in dnn-based speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 177–182.
[18] J. Stockholm and P. Pasquier, “Reinforcement Learning of Listener Response for Mood Classification of Audio,” in 2009 International Conference on Computational Science and Engineering, vol. 4, 2009, pp. 849–853.
[19] H. Yu and P. Yang, “An Emotion-Based Approach to Reinforcement Learning Reward Design,” in 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC), 2019, pp. 346–351.
[20] M. G. Lagoudakis and R. Parr, “Reinforcement learning as classification: leveraging modern classifiers,” ser. ICML’03. AAAI Press, 2003, pp. 424–431.
[21] H. Han, K. Byun, and H. G. Kang, “A deep learning-based stress detection algorithm with speech signal,” in AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018. New York, NY, USA: Association for Computing Machinery, Inc, 10 2018, pp. 11–15. [Online]. Available: https://dl.acm.org/doi/10.1145/3264869.3264875
[22] S. Latif, R. Rana, and J. Qadir, “Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness,” arXiv, 11 2018. [Online]. Available: http://arxiv.org/abs/1811.11402
[23] R. Rana, S. Latif, R. Gururajan, A. Gray, G. Mackenzie, G. Humphris, and J. Dunn, “Automated screening for distress: A perspective for the future,” European Journal of Cancer Care, vol. 28, no. 4, 7 2019. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/30883964/
[24] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep Q-learning from Demonstrations,” arXiv:1704.03732 [cs], 2017. [Online]. Available: http://arxiv.org/abs/1704.03732
[25] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. dissertation, Cambridge, UK, 1989.
[26] M. Wiering, “Explorations in Efficient Reinforcement Learning,” Ph.D. dissertation, 1999. [Online]. Available: https://dare.uva.nl/search?identifier=6ac07651-85ee-4c7b-9cab-86eea5b818f4
[27] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu, “Boltzmann exploration done right,” ser. NIPS’17. Curran Associates Inc., 2017, pp. 6287–6296.
[28] L. Pan, Q. Cai, Q. Meng, W. Chen, and L. Huang, “Reinforcement learning with dynamic boltzmann softmax updates,” in IJCAI International Joint Conference on Artificial Intelligence, vol. 2021-January. International Joint Conferences on Artificial Intelligence, 7 2020, pp. 1992–1998. [Online]. Available: https://www.ijcai.org/proceedings/2020/276
[29] F. Leibfried and P. Vrancx, “Model-Based Regularization for Deep Reinforcement Learning with Transcoder Networks,” arXiv, 9 2018. [Online]. Available: http://arxiv.org/abs/1809.01906
[30] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[31] D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication, vol. 48, no. 9, pp. 1162–1181, 2006.
[32] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., vol. 2, 2003, pp. II–1.
[33] D. A. Cairns and J. H. L. Hansen, “Nonlinear analysis and classification of speech under stressed conditions,” The Journal of the Acoustical Society of America, vol. 96, no. 6, pp. 3392–3400, 1994. [Online]. Available: https://asa.scitation.org/doi/10.1121/1.410601%****␣main.bbl␣Line␣250␣****files/674/1.html
[34] C. Lee, S. S. Narayanan, and R. Pieraccini, “Combining acoustic and language information for emotion recognition,” 2002. [Online]. Available: https://www.semanticscholar.org/paper/Combining-acoustic-and-language-information-for-Lee-Narayanan/d5abf8adb874577dffc4038207b6b91bee0a3450
[35] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct Modelling of Speech Emotion from Raw Speech,” 2019. [Online]. Available: http://arxiv.org/abs/1904.03833
[36] K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” in Interspeech 2014, 9 2014.
[37] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross Corpus Speech Emotion Classification - An Effective Transfer Learning Technique,” 2018.
[38] K. Y. Huang, C. H. Wu, Q. B. Hong, M. H. Su, and Y. H. Chen, “Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May. Institute of Electrical and Electronics Engineers Inc., 5 2019, pp. 5866–5870.
[39] R. Polikar, L. Upda, S. S. Upda, and V. Honavar, “Learn++: an incremental learning algorithm for supervised neural networks,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 31, no. 4, pp. 497–508, 2001.
[40] Q. L. Zhao, Y. H. Jiang, and M. Xu, “Incremental learning by heterogeneous bagging ensemble,” in International Conference on Advanced Data Mining and Applications. Springer, 2010, pp. 1–12.
[41] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, p. 335, 2008. [Online]. Available: https://doi.org/10.1007/s10579-008-9076-6
[42] S. Haq, P. J. B. Jackson, and J. Edge, “Audio-visual feature selection and reduction for emotion classification,” 2008.
[43] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
[44] B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” vol. 8, 2015.
[45] F. Chollet and others, Keras, 2015. [Online]. Available: https://keras.io
[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. [Online]. Available: https://www.tensorflow.org/
[47] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends,” arXiv:2001.00378 [cs, eess], 2020. [Online]. Available: http://arxiv.org/abs/2001.00378

A Novel Policy for Pre-trained Deep Reinforcement Learning for Speech Emotion Recognition