Power pooling: An adaptive pooling function for weakly labelled sound event detection

Yuzhuo Liu Hangting Chen Yun Wang and Pengyuan Zhang^✉

Abstract

Access to large corpora with strongly labelled sound events is expensive and difficult in engineering applications. Much research turns to address the problem of how to detect both the types and the timestamps of sound events with weak labels that only specify the types. This task can be treated as a multiple instance learning (MIL) problem, and the key to it is the design of a pooling function. The state-of-the-art linear softmax pooling function, however, cannot flexibly deal with sound sources of different time scales. In this paper, we propose an adaptive power pooling function which can automatically adapt to various sound sources. On two public datasets, the proposed power pooling function outperforms linear softmax pooling on both coarse-grained and fine-grained metrics. Notably, it improves the event-based $F_{1}$ score by 11.4% and 10.2% relative on the two datasets. While this paper focuses on sound event detection applications, the proposed method can be applied to MIL tasks in other domains.

1 Introduction

Sound event detection (SED) aims to identify the categories and timestamps of target sound events in continuous audio recordings. Some studies only focus on the categories of sound events present (audio tagging), while this paper pays more attention to the detection of onsets and offsets of sound events (localization). Traditional SED models are often trained from data with strong labels, which contain the categories and timestamps of each sound event occurrence [1, 2, 3, 4]. However, in real-world applications, such as noise monitoring, surveillance systems, machine condition monitoring, and multimedia indexing, acquiring such strong labels can incur a high cost. In 2017, Google released a large-scale weakly labelled dataset (AudioSet) [5] with annotations of only event categories at a coarse time resolution (10 seconds). AudioSet has led researches to pay more attention to weakly labelled SED, i.e. when no fine-grained timestamps are available.

Weakly labelled SED can be addressed as a multiple instance learning (MIL) problem. An audio clip and the frames in it can be regarded as a bag and instances in the bag. A positive bag is a clip that contains a certain event; it consists of at least one positive frame and may also contain negative frames. A negative bag, on the other hand, consists only of negative frames. For improving the localization accuracy, considerable researches have made efforts to select positive instances more precisely. For weakly supervised object detection, Wan et al. [6] introduced a continuation optimization algorithm. Yang et al. [7] proposed to jointly train a MIL branch and a bounding-box regression branch. These methods activate more positive samples by avoiding falling into local minima. Cheng et al. [8] focused on generating and selecting high-quality proposals to better envelop all positive samples.

As for SED, many studies are devoted to the design of pooling functions. A MIL system predicts a probability for each frame, and aggregates the frame-level predictions into a clip-level probability via a pooling function. As shown in Fig. 1, the pooling function calculates the clip-level probability as a weighted average of the frame-level probabilities, and also serves to back-propagate gradients from the clip-level loss function to the frames. Ideally, the pooling function should be discriminative enough to produce positive gradients for positive frames and negative gradients for negative frames. “Max” pooling assigns zero weights to non-maximizing frames; this produces zero gradients and leads to difficult optimization. Average pooling [9] weights all frames equally and produces positive gradients for all frames, which is not discriminate enough for positive bags with negative instances. Linear softmax pooling [10], exponential softmax pooling [11], and attention pooling [12] assign a different weight to each frame, thereby varying the sign and magnitude of the their gradients. Wang et al. [13] compared the above five pooling functions, and demonstrated that linear softmax pooling was the best at localizing sound events because it could produce positive gradients for some frames and negative gradients for others. McFee et al. [14] developed a family of adaptive pooling operators named auto-pool which could achieve a similar effect. He et al. [15] proposed a hierarchical pooling structure.

It remains challenging, however, to predict the onsets and offsets of sound events using these pooling functions. A core reason is that environmental sounds in general have less structure in comparison to speech and music. Many independent sources (e.g. animals, vehicles, electrical appliances) entail characteristics with considerable variability. One of the key factors that cause variability is the various durations of different events, and these pooling functions cannot adapt to such variability. Besides, as synthetic strongly labelled data and unlabelled data are easier to obtain in real life, studies and challenges such as DCASE 2019 Task 4 [16] have turned to semi-supervised SED with synthetic strongly labelled, weakly labelled and unlabelled data. Nevertheless, to our knowledge, few studies have investigated the effects of applying different pooling functions in semi-supervised SED.

To mitigate the above issues, we design a simple but effective pooling function termed as power pooling. We use a power function of the predicted frame-level probability as the weight for each frame, and set the exponent as a trainable parameter to automatically generate an optimal threshold between positive and negative gradients. The power pooling provides variable gradient directions for frame-level predictions with a flexible threshold and improves its adaptivity. The trainable power parameter can also be made dependent on the event category, thus adapting to sources with different types of acoustic characteristics. We evaluate the proposed method on a purely weakly labelled dataset (DCASE 2017 Task 4) and a semi-supervised dataset (DCASE 2019 Task 4). Our empirical results show that power pooling outperforms other pooling functions on all metrics.

Refer to caption — Figure 1: Pooling function in a MIL system for SED with weak labels. A pooling function produces weights for fine-grained prediction to obtain a coarse-grained prediction, and generates fine-grained gradients from a coarse-grained loss. The red plus signs indicate samples whose ground truth is positive, and the green minus signs indicate samples whose ground truth is negative. $d$ is the ideal threshold at which the gradients change sign. Arrows indicate the directions of the gradients.

2 Pooling functions

Polyphonic SED with $C$ event categories can be regarded as $C$ binary MIL problems. As such, we will only consider one event category hereafter. Each training clip can be regarded as a bag $(X,Y)$ , where $X=\{x_{1},\ldots,x_{m}\}$ are the instances (frames), and $Y\in\{0,1\}$ is the label of the bag. As illustrated in Fig. 1, an SED system predicts a frame-level probability $y_{i}^{f}\in[0,1]$ for each instance $x_{i}$ . The pooling function aggregates $y_{i}^{f}$ into the clip-level prediction $y^{c}\in[0,1]$ by assigning a weight $w_{i}$ to each frame and taking the weighted average:

y^{c}=\frac{\sum_{i}y_{i}^{f}\times w_{i}}{\sum_{i}w_{i}}.

(1)

The loss function is chosen to be the cross-entropy between the clip-level prediction $y^{c}$ and the label $Y$ . During back-propagation, the pooling function determines the gradient received by each instance, and the gradients should have appropriate signs. The formulas of the gradients for five types of pooling functions can be found in [13].

A pooling function should generally assign larger weights to frames with larger predicted probabilities. This is in order to conform to the standard multiple instance (SMI) assumption: the bag label is positive if and only if the bag contains at least one positive instance. The state-of-the-art linear softmax pooling uses $y_{i}^{f}$ itself as the weight $w_{i}$ :

y^{c}=\frac{\sum_{i}(y_{i}^{f}\times y_{i}^{f})}{\sum_{i}y_{i}^{f}},

(2)

and its gradient is

\frac{\partial y^{c}}{\partial y_{i}^{f}}=\frac{2\times y_{i}^{f}-y^{c}}{\sum_{j}y_{j}^{f}}.

(3)

\processtable

The directions of the clip-level and frame-level gradients in linear softmax pooling ( $\theta=1/2$ ) and power pooling ( $\theta=n/(n+1)$ ). Label Clip-level Condition Frame-level positive ( $Y=1$ ) $y^{c}\rightarrow 1$ $y_{i}^{f}>\theta\times y^{c}$ $y_{i}^{f}\rightarrow 1$ $y_{i}^{f}<\theta\times y^{c}$ $y_{i}^{f}\rightarrow 0$ negative ( $Y=0$ ) $y^{c}\rightarrow 0$ $y_{f}>\theta\times y^{c}$ $y_{i}^{f}\rightarrow\theta\times y^{c}$ $y_{f}<\theta\times y^{c}$ $y_{i}^{f}\rightarrow\theta\times y^{c}$

This gradient is positive if and only if $y_{i}^{f}>y^{c}/2$ . For positive clips ( $Y=1$ ), this causes “larger” frame-level probabilities to increase and “smaller” frame-level probabilities to decrease, thereby yielding clear boundaries of event occurrences. The threshold between “larger” and “smaller” probabilities is given by $d=y^{c}/2$ . For negative clips ( $Y=0$ ), the gradients pushes all frame-level probabilities toward $y^{c}/2$ . Considering that this threshold is smaller than $y^{c}$ , all the frame-level probabilities will converge to 0 as desired after enough iterations. The movements of the frame-level probabilities are listed in Table 2 as well as depicted in Fig. 1.

Define $\theta=d/y^{c}$ as the ratio between the threshold at which the gradient changes sign and the clip-level predicted probability. In linear softmax pooling, $\theta$ is fixed at $1/2$ . In reality, however, it may be desirable to have a different $\theta$ for different event categories. For example, we may want to boost the predicted probabilities of more frames when a clip contains a type of event that usually lasts a long time (e.g. vacuum cleaner), and boost fewer frames when the type of event in question is usually transient (e.g. dog bark). This motivated us to propose power pooling.

3 Power pooling

Without changing the pattern of how predictions move in Table 2, we hope to make the threshold $d$ variable by adding a small number of trainable parameters on the basis of the linear softmax pooling function. We use a trainable parameter, $n$ , as the exponent of the frame-level probabilities, $y_{i}^{f}$ , and the formula of pooling function can be written as:

y^{c}=\frac{\sum_{i}y_{i}^{f}\times(y_{i}^{f})^{n}}{\sum_{i}(y_{i}^{f})^{n}},

(4)

and its gradient can be written as:

\frac{\partial y^{c}}{\partial y_{i}^{f}}=\frac{(n+1)\times(y_{i}^{f})^{n}-n\times(y_{i}^{f})^{n-1}\times y^{c}}{\sum_{j}(y_{j}^{f})^{n}}.

(5)

To conform to the SMI assumption, $w_{i}=(y_{i}^{f})^{n}$ must be an increasing function, therefore $n$ must be non-negative. The gradient will still have different signs for different frames; the threshold is given by $d=\theta\cdot y^{c}$ , where $\theta=n/(n+1)\in(0,1)$ . During back-propagation, frame-level probabilities will move in the same pattern as in Table 2. The pooling takes the $n$ -th power of the frame-level probability $y_{i}^{f}$ as its weight, so we refer to it as power pooling.

Power pooling inherits the advantage of linear softmax pooling at localizing sound events. In addition, the learnable power parameter allows it to approach either max pooling (as $n\rightarrow+\infty$ ) or average pooling (as $n\rightarrow 0$ ). When $n$ is fixed to 1, power pooling reduces to linear softmax pooling. It is desirable to make the power $n$ depend on the event category: for long-lasting events we prefer to choose a smaller $n$ , which results in a lower threshold and boosts more frames; for transient events we would do the opposite. When the power $n$ gets too large, however, power pooling can suffer from the same problem of zero gradients as max pooling. To avoid this, we add a regularization term of $\lambda\sum_{c}n_{c}^{2}$ to the loss function, where $n_{c}$ is the power parameter for event category $c$ .

4 Experiments and discussion

We carry out experiments to compare the performance of power pooling with other pooling functions on a purely weakly labelled dataset—DCASE 2017 Task 4 [17] and a semi-supervised SED dataset—DCASE 2019 [16]. The realistic recordings of both datasets are subsets of the AudioSet dataset [5], and DCASE 2019 has a subset with synthetic recordings. Most clips have a duration of 10 seconds (a few clips are shorter), and multiple audio events may occur at the same time.

The dataset of Task 4 of the DCASE 2017 challenge is composed of 17 types of “warning” and “vehicle” sounds. We take the weakly labelled training set (51,172 clips) and the strongly labelled public test set (488 clips). The DCASE 2019 dataset focuses on 10 types of sound events in domestic environments. It consists of three training sets (synthetic strongly labelled: 2,045 clips, weakly labelled: 1,578 clips, unlabelled: 14,412 clips) and a validation set (1,168 clips). The mean durations of events in DCASE 2017 are in the range of 4-10 s, covering more than 40% of the clips. The mean durations in DCASE 2019 are gathered in 0.5-5 s, covering less than 50% of the clips. These two datasets contain relatively long and short events, respectively.

The performance of systems is measured with fine-grained and coarse-grained (clip-level) $F_{1}$ , which balances precision and recall. For fine-grained evaluation, we adopt both event-level metrics with a 200 ms collar on onsets and a collar of 200 ms or 20 $\%$ of the event length on offsets, and segment-level metrics with the segment duration set to 1 s. Besides, we aggregate the metrics across event categories using the macro-average. The evaluation details can be found in [18].

The data preprocessing and model architecture on DCASE 2017 are nearly the same as in [13]. In a nutshell, the input filterbank features have 400 frames and 64 frequency bins, and the model structure consists of 3 convolutional blocks, 2 BiGRU layers and 1 dense layer. We add a batch norm layer to each convolutional block. Since DCASE 2019 contains unlabelled data, a semi-supervised framework is applied. The data preprocessing and backbone model are based on methods in [19]. We adopt the popular mean-teacher [20] architecture and a feature extractor with convolutional recurrent neural networks [4]. Furthermore, the following optimizations are performed: First, we augment the data by shifting input features along the time axis, sampling the shift from a normal distribution with zero mean and a standard deviation of 16 frames. Second, we adopt the architecture of the feature extractor in [21]. Third, we apply a set of median filters on the frame-level predicted probabilities, using window sizes proportional to the average duration of each event category. Finally, the regularization hyperparameter $\lambda$ for the power parameters is set to $10^{-4}$ for DCASE 2017 and 0 for DCASE 2019.

\processtable

Detailed results on DCASE 2017. Pooling Function DCASE 2017 Event-level Segment-level Clip-level $F_{1}$ Precision Recall $F_{1}$ Precision Recall $F_{1}$ Max 0.094 0.169 0.073 0.372 0.567 0.300 0.465 Average 0.165 0.147 0.196 0.450 0.425 0.515 0.516 Exponential 0.166 0.154 0.187 0.466 0.472 0.481 0.521 Attention [12] 0.130 0.139 0.171 0.434 0.481 0.470 0.527 Auto [14] 0.169 0.144 0.217 0.457 0.425 0.535 0.536 CAP [14] 0.164 0.152 0.199 0.468 0.447 0.521 0.544 RAP $10^{-2}$ [14] 0.176 0.147 0.233 0.464 0.411 0.561 0.532 RAP $10^{-3}$ [14] 0.165 0.145 0.202 0.464 0.432 0.529 0.534 RAP $10^{-4}$ [14] 0.158 0.132 0.206 0.455 0.410 0.539 0.526 Linear [13] 0.162 0.178 0.161 0.471 0.542 0.451 0.535 Power 0.196 0.168 0.248 0.480 0.460 0.537 0.545

\processtable

Detailed results on DCASE 2019. Pooling Function DCASE 2019 Event-level Segment-level Clip-level $F_{1}$ Precision Recall $F_{1}$ Precision Recall $F_{1}$ Max 0.256 0.381 0.201 0.488 0.836 0.362 0.609 Average 0.171 0.158 0.201 0.564 0.499 0.675 0.597 Exponential 0.187 0.190 0.203 0.569 0.559 0.654 0.611 Attention [12] 0.320 0.359 0.300 0.600 0.688 0.547 0.386 Auto [14] 0.218 0.288 0.180 0.597 0.755 0.526 0.655 CAP [14] 0.188 0.217 0.171 0.598 0.628 0.601 0.641 RAP $10^{-2}$ [14] 0.177 0.189 0.179 0.584 0.544 0.668 0.639 RAP $10^{-3}$ [14] 0.172 0.175 0.181 0.586 0.530 0.682 0.640 RAP $10^{-4}$ [14] 0.178 0.229 0.151 0.537 0.602 0.533 0.516 Linear [13] 0.343 0.431 0.292 0.583 0.738 0.498 0.655 Power 0.378 0.437 0.340 0.624 0.752 0.547 0.694

Table 4 and Table 4 compare power pooling with three classic pooling functions, the popular attention pooling [12], a family of auto pooling (Auto, CAP, RAP) [14] and the baseline linear softmax pooling [13] on DCASE 2017 and DCASE 2019 datasets. We indicate in bold the best $F_{1}$ scores across all systems, as well as the best apart from power pooling. The impact of pooling functions on the semi-supervised dataset and the weakly labelled dataset appears to be similar. Power pooling achieves the highest $F_{1}$ at all the three levels on both datasets. The fact that power pooling benefits both fine-grained and coarse-grained SED indicates that it yields proper weights and gradients. As for event-level SED which this article focuses on, power pooling shows an improvement of 2% and 3.5% absolute, or 11.4% and 10.2% relative on the two datasets. The baseline linear softmax pooling function produces significantly worse recall scores than power pooling, while achieving only comparable precision. This phenomenon demonstrates that the power pooling can find better power parameters and reduce false negatives.

Fig. 2 illustrates the predictions for the “vacuum cleaner” event on a weakly labelled training clip, produced by a system after 40 epochs of training (we trained for 200 epochs in total). We also show the actual temporal interval spanned by the event. For the power pooling system, we show the threshold between positive and negative gradients arising from both the power pooling function ( $d_{\text{power}}=n/(n+1)\cdot y^{c}$ ) and the linear softmax pooling function ( $d_{\text{linear}}=1/2\cdot y^{c}$ ). The power parameter for the “vacuum cleaner” event is $n=0.337<1$ ; therefore power pooling yields a lower threshold, allowing more frames to receive positive gradients. Compared with the linear softmax pooling system, more frames in the power pooling system receive a gradient in the correct direction, notably from 0.5 s to 3 s. This indicates how a more appropriate threshold between positive and negative gradients can help to correctly pinpoint the onset and offset of events.

Fig. 3 shows the power parameters $n_{c}$ of each event category in the final model. We sort the event categories horizontally by their average duration, and divide them roughly into shorter events (green and blue symbols) and longer events (red or yellow symbols). The durations of shorter and longer events in the DCASE 2017 dataset fall within 4–5 s and 6–10 s; for DCASE 2019, these durations fall within 0.5–1.1 s and 2–5 s. Except for a small number of categories (represented by triangles), longer events tend to have smaller power parameters, making the power pooling approach average pooling; shorter events tend to have larger power parameters, making the power pooling approach max pooling. This is observed both within each dataset and across the two datasets, and agrees well with the motivation.

5 Conclusion

This paper has proposed a practical power pooling function for weakly labelled SED. Power pooling overcomes the shortcoming of the state-of-the-art linear softmax pooling that the weight of a frame is determined by a fixed formula from its predicted probability. With only one learnable power parameter per event category added, power pooling can automatically learn an appropriate threshold between positive and negative gradients for each event category. This allows power pooling to adapt to various sound events of different time scales. Experiments illustrate that power pooling achieves the highest $F_{1}$ on all levels on both weakly labelled SED and semi-supervised SED. Moreover, the power pooling function is generic enough to be applied to MIL problems in other domains.

\ack

This work is supported by the National Natural Science Foundation of China (No. 62071461).

Yuzhuo Liu, Hanting Chen and Pengyuan Zhang(Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, University of Chinese Academy of Sciences, China) Yun Wang(Facebook AI Applied Research)

E-mail: [email protected], [email protected], [email protected], [email protected]

References

[1] E. Cakir, T. Heittola, H. Huttunen, et al., “Polyphonic sound event detection using multi label deep neural networks”, Proceedings of the International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, July 2015, pp. 1–7.
[2] E. Cakir, E. C. Ozan, T. Virtanen, et al., “Filterbank learning for deep neural network based polyphonic sound event detection”, IJCNN, Vancouver, July 2016, pp. 3399–3406.
[3] G. Parascandolo, H. Huttunen, T. Virtanen, et al., “Recurrent neural networks for polyphonic sound event detection in real life recordings”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, March 2016, pp. 6440–6444.
[4] E. Cakir and T. Virtanen, “End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input”, IJCNN, Rio de Janeiro, Brazil, July 2018, pp. 2412–2418.
[5] J. F. Gemmeke, et al., “Audio Set: An ontology and human-labelled dataset for audio events”, ICASSP, New Orleans, LA, 2017, pp. 776–780, doi: 10.1109/ICASSP.2017.7952261.
[6] F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye, "C-MIL: Continuation multiple instance learning for weakly supervised object detection", IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2199-2208.
[7] K. Yang, D. Li, and Y. Dou, "Towards precise End-to-End weakly supervised object detection network", IEEE/CVF Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8372-8381.
[8] G. Cheng, J. Yang, D. Gao, L. Guo and J. Han, "High-Quality Proposals for Weakly Supervised Object Detection," IEEE Transactions on Image Processing, vol. 29, pp. 5794-5804, 2020, doi: 10.1109/TIP.2020.2987161.
[9] A. Shah, A. Kumar, A. Hauptmann, and B. Raj, “A closer look at weak label learning for audio events”, 2018, 10.13140/RG.2.2.29936.76806.
[10] A. Dang, T. H. Vu, and J.-C. Wang, “Deep learning for DCASE2017 challenge”, Technical Report, Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge, 2017.
[11] J. Salamon, B. McFee, and P. Li, “DCASE 2017 submission: Multiple instance learning for sound event detection”, Technical Report, DCASE 2017 Challenge, 2017.
[12] Q. Kong, et al., “Audio Set classification with attention model: A probabilistic perspective”, ICASSP, April 2018.
[13] Y. Wang, J. Li, and F. Metze, “A comparison of five multiple instance learning pooling functions for sound event detection with weak labelling”, ICASSP, 2019, pp. 31–35.
[14] B. Mcfee, J. Salamon, and J. P. Bello, “Adaptive pooling operators for weakly labelled sound event detection”, Audio, Speech, and Language Processing, IEEE/ACM Transactions, 2018, pp. 2180–2193.
[15] K. He, Y. Shen, and W. Zhang, “Hierarchical pooling structure for weakly labelled sound event detection”, Proceedings of Interspeech), 2019, pp. 3624–3628.
[16] R. Serizel, N. Turpault, E. Hamid, et al., “Large-scale weakly labelled semi-supervised sound event detection in domestic environments”, DCASE 2018 workshop, 2018, pp. 19–23.
[17] A. Mesaros, T. Heittola, A. Diment, et al., “DCASE 2017 challenge setup: Tasks, datasets and baseline system”, DCASE 2017 Workshop, 2017, pp. 85–92.
[18] A. Mesaros, T. Heittola, T. Virtanen, et al., “Metrics for polyphonic sound event detection”, Applied Sciences, 2016, pp. 162–167.
[19] J. K. Lu, “Mean teacher convolution system for DCASE 2018 Task 4”, Technical Report, DCASE 2018 Challenge, June 2018.
[20] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results”, Advances in neural information processing systems, 2017, pp. 1195–1204.
[21] L. Delphin-Poulat and C. Plapous, “Mean teacher with data augmentation for DCASE 2019 Task 4”, Technical Report, DCASE 2019 Challenge, June 2019.