DcaseNet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events
Abstract
Although acoustic scenes and events include many related tasks, their combined detection and classification have been scarcely investigated. We propose three architectures of deep neural networks that are integrated to simultaneously perform acoustic scene classification, audio tagging, and sound event detection. The first two architectures are inspired by human cognitive processes. The first architecture resembles the short-term perception for scene classification of adults, who can detect various sound events that are then used to identify the acoustic scene. The second architecture resembles the long-term learning of babies, being also the concept underlying self-supervised learning. Babies first observe the effects of abstract notions such as gravity and then learn specific tasks using such perceptions. The third architecture adds a few layers to the second one that solely perform a single task before its corresponding output layer. The aim is to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks. Experiments on three datasets demonstrate that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task. The code and pretrained DcaseNet weights are available at https://github.com/Jungjee/DcaseNet.
Index Terms— Deep neural networks, acoustic scene classification, audio tagging, sound event detection
1 Introduction
Recent advances in deep learning have led to improved performance of acoustic scene classification (ASC) and event related systems in various applications [1, 2, 3, 4, 5, 6]. The detection and classification of acoustic scenes and events (DCASE) community holds annual challenges with public datasets [7, 8, 9]. The DCASE challenge datasets have fostered research on various tasks including ASC, audio tagging (TAG), sound event detection (SED), bird audio detection, and sound localization.
However, these tasks have been independently studied using different deep neural networks (DNNs), despite that their characteristics and required information are highly related. Few studies have explored DNN architectures that use two tasks. Jung et al. [10, 11] applied an attention mechanism for ASC using a pretrained TAG system. The study was inspired by human cognition (e.g., [12]), as humans first perform SED and leverage this information to classify scenes. For instance, perceiving car horns and traffic sounds can be helpful for knowing that he/she is standing in a street. Imoto et al. [13, 14] explored the relation between ASC and SED, proposing DNNs to perform the two tasks simultaneously through a multi-task learning framework [15]. However, the integration of DNNs for related tasks remains in a preliminary stage because only pairs of two related tasks are investigated and motivations on the relationship of these tasks has not been explored. Hence, studies which address motivations for using multiple tasks and analyze the relationship between such tasks requires further investigation.
We explore the integration of three tasks by proposing different DNN architectures (Fig. 1), from which two are inspired by human cognitive processes and the third extends the second architecture. The integrated framework performs two segment-level tasks and one frame-level task using a single DNN that jointly learns and performs ASC, TAG, and SED (three tasks are introduced in details in Section 2). The first architecture resembles the short-term perception of adults for ASC, being similar to the method used by Jung et al. [10]. Instead of using a TAG system to improve an ASC system, we propose an integrated framework that first performs SED followed by ASC and TAG.
The second architecture resembles the long-term learning of babies, which is also a motivation for self-supervised learning studies [16, 17, 18]. In self-supervised learning, pretraining a DNN for relatively abstract tasks and then fine-tuning it for a specific task is an effective approach. It mimics babies first acquiring abstract notions (e.g., the effects of gravity) and then learning specific tasks based on the corresponding perceptions. Likewise, we assume that segment-wise multiclass ASC might require a relatively low abstraction level compared with SED, which requires frame-wise multilabel binary classification. Thus, in the proposed architecture, we consider the abstraction level of each task and perform relatively coarse scene classification followed by specific SED. Comparative experiments demonstrate that the second architecture provides high performance, and thus we extend it using a few separate layers for each task.
The goal of this study is to build a single DNN that integrates ASC, TAG, and SED and can be fine-tuned to emphasize the performance of any of these tasks. The proposed DNNs, called DcaseNets, are intended to establish pretrained models for a wide range of acoustic tasks, like the ImageNet pretrained model [19]. The main contributions of this study are threefold:
-
1.
Integrating DNN architectures, DcaseNets, to simultaneously perform ASC, TAG, and SED.
-
2.
Developing DNN architectures based on human cognitive processes, namely, scene perception of adults and long-term learning of babies.
-
3.
Demonstrating that fine-tuning the integrated DNNs for a specific task improves performance.

The remainder of this paper is organized as follows. In Section 2, we describe ASC, TAG, and SED and discuss early studies on their integration. In Section 3, we explain the proposed DNNs for setting integrated architectures to perform the three tasks. In Section 4, we report the experiments and results to verify the architecture performance. Finally, we draw conclusions in Section 5.
2 Related Work
ASC is a multiclass classification task that identifies a segment as being one of predefined scenes (i.e., classes). Acoustic scenes have an abstract (i.e., ambiguous) definition, and thus various characteristics may coincide across different scenes [20, 21]. For example, ‘airport’ and ‘shopping_mall’, which are both predefined scenes in the DCASE ASC challenge, may contain people talking and the acoustic properties of large indoor spaces. These characteristics lead to a large intraclass variability, possibly degrading the classification performance. In addition, recent studies on ASC have addressed the impact of different recording devices on classification [22, 23, 24].
TAG and SED are both multilabel binary classification tasks that perform event detection (i.e., judge the presence of various sounds). TAG conducts segment-level event detection and provides a vector in which each dimension is a real number between 0 and 1, indicating the probability of presence of a sound event throughout an input segment. SED conducts frame-level event detection and provides a matrix in which each row indicates the presence of sound events in a frame. Therefore, SED performs TAG with the onset and offset information from each sound event. Various studies on deep learning have addressed frame-level and the subsequent aggregation into segment-level classification. Likewise, we perform SED before TAG in the three proposed DNN architectures.
Scarce research on the combination of related sound classification tasks has been conducted [13, 14, 10, 11]. Imoto et al. [13] assumed that ASC and SED are related and performed them simultaneously using a multitask learning framework. Jung et al. [10] aimed to mimic the human perception mechanism of leveraging TAG for ASC by applying an attention mechanism using the output of a pretrained TAG network. However, we assume that analysis and architecture designs accounting for the task relations can further improve integrated systems.
3 Integrated Framework
We propose and investigate three DNN architectures to jointly perform ASC, TAG, and SED, as illustrated in Fig. 1. The architectures are called as DcaseNet because it jointly detects and classifies acoustic scenes and events. The design choice of the first and second architectures (Fig. 1-(a) and -(b)) is inspired by human cognitive processes. The first architecture resembles the perception mechanism of an adult, expanding the method in [10]. This architecture first performs the event detection (i.e., SED) using a convolutional recurrent neural network (CRNN) to then perform segment-level scene classification (i.e., ASC). From the two event detection tasks, TAG and SED, we perform SED in lower layers (closer to the input layer) in the three proposed architectures by assuming that detecting short events improves the overall detection throughout an audio segment.
The second architecture (Fig. 1-(b)) resembles long-term learning of babies. First, abstract notions (e.g., gravity and dimension) are acquired to then perform specific tasks (e.g., moving things) using these notions. Studies on self-supervised learning [16, 17, 18] are also based on this process to pretrain DNNs with unlabeled data. Similarly, we consider the required abstraction level in each task representation. We assume that ASC requires a relatively lower abstraction level than both TAG and SED. Hence, the second architecture performs ASC after a few convolutional neural network (CNN) blocks and then performs TAG and SED after a gated recurrent unit (GRU) layer.
As the two architectures are inspired by human cognition, we experimentally test their performances and find that the second architecture tends to outperform the first one. Thus, the third architecture (Fig. 1-(c)) extends the second architecture with additional layers before conducting each task. These separate layers are also concatenated and feed-forwarded to subsequent layers. As applied in [25], we aim to maintain an information path (green layers in Fig. 1) while dedicating few layers to solely concentrate on performing an individual task. In addition, instead of performing TAG and SED in parallel, hidden layers perform SED and TAG in sequence, assuming that TAG, a segment-level task, requires a relatively higher abstraction level than SED, a frame-level task. For the three architectures, we first perform single and joint training. The models trained for single tasks establish the baselines for the corresponding architectures. After training, we fine-tune every model by using jointly trained DNNs for each task to determine the final performance on the target task.
Each component of DcaseNet is based on high-performing architectures for each task. The CRNN (two light green boxes in Fig. 1) in the three DcaseNet architectures is the one used by Cao et al. [26], who achieved the second place in task 3 of the DCASE 2019 challenge using this network. The CRNN adopts eight convolutional layers with batch normalization [27], followed by a bidirectional GRU layer. ‘CNN’ at the top of Fig. 1-(a) establishes a residual block and was implemented in [11], which was submitted for task 1-a in the DCASE 2020 challenge. ‘Dense’ block in the three architectures before the TAG output is the one used by Akiyama et al. [28], who won task 2 of the DCASE 2019 challenge. It adopts two fully-connected layers followed by dropout [29] and ReLU non-linear activation. ‘Dense’ block before the SED output in Fig. 1-(b) also comprises two fully-connected layers with dropout and ReLU activation.
Task | ASC | TAG | SED |
Train duration (hours) | 43.0 | 10.5 | 10.0 |
# Train segments | 13,965 | 3,976 | 600 |
Segment duration (seconds) | 10 | 0.330 | 60 |
# Evaluation segments | 2,970 | 994 | 100 |
# Classes | 10 | 80 | 14 |
4 Experiments and results
4.1 Datasets and metrics
We use the DCASE 2020 Task 1-a dataset for the ASC Task, DCASE 2019 Task 2 dataset for the TAG task, and DCASE 2020 Task 3 dataset for the SED task. The dataset specifications including duration, number of segments, and number of classes are listed in Table 1. For TAG, we only use the curated training set, thus excluding the noisy set and also split 20% of the dataset to report evaluation performance, because the labels for the challenge evaluation set are not publicly available and a separate validation set does not exist111The train/test split we used is available along with our code in the Github repository.. We resampled all the segments to a rate of 24 kHz and used 16-bit resolution and monaural audio.
The performances of the proposed architectures are reported using four metrics: overall classification accuracy (Acc) for ASC, label-weighted label-ranking precision (lwlrap) for TAG, and F-score (F1) and error rate (ER) for SED. Higher values indicate better results for all metrics except for ER. For brevity, we omit the detailed instructions on each metric, which can be found on the DCASE website.
4.2 Experimental configurations
We use the 128-dimensional Mel-spectrograms as input features to each DNN. The spectrograms are extracted using 2,048 bins of the fast Fourier transform and 40 ms windows with 20 ms overlap. To train using multiple datasets concurrently, we configure 500 iterations as one epoch and train 160 epochs. The batch sizes for ASC, TAG, and SED tasks are 32, 24, and 32, respectively. During training for ASC and TAG, the segment duration is randomly cropped to 5 and 30 s for SED to construct the mini-batch and obtain data augmentation effect. We use Adam optimization [30] with a learning rate of 0.001. For both joint training and fine-tuning towards each task, the hyperparameters are not changed.
The common CRNN for the three architectures comprises eight convolutional layers followed by batch normalization layers, where the last convolutional layer has 512 output filters. A bidirectional GRU layer with 512 nodes is used. ‘Dense’ block before the TAG output comprises two fully-connected layers with 1,024 nodes per layer. Other detailed configurations can be found in the Github repository of this study.
Architecture | ASC | TAG | SED | |
Acc | lwlrap | F1 | ER | |
DCASE baseline | 54.10 | - | 60.60 | 0.5400 |
Reference systems | 65.30 | - | 76.20 | 0.3050 |
DcaseNet-v1 (Fig. | 67.68 | 68.01 | 74.80 | 0.3486 |
1-(a))(w/o Mix-up) | ||||
DcaseNet-v1 (Fig. | 68.19 | 69.41 | 79.62 | 0.2926 |
1-(a))(w/ Mix-up) | ||||
DcaseNet-v2 | 69.54 | 69.19 | 79.34 | 0.3085 |
(Fig. 1-(b)) | ||||
DcaseNet-v3 | 68.33 | 70.62 | 78.61 | 0.3085 |
(Fig. 1-(c)) | ||||
Architecture | #Param | Task combination | Joint training (rand init) | Fine-tune (joint train init) | ||||||||
ASC | TAG | SED | ASC | TAG | SED | ASC | TAG | SED | ||||
Acc | lwlrap | F1 | ER | Acc | lwlrap | F1 | ER | |||||
DcaseNet-v1 | 8.7M | ✓ | ✓ | 59.23 | 53.55 | - | - | 67.18 | 68.81 | - | - | |
(Fig. 1-(a)) | 8.7M | ✓ | ✓ | 57.04 | - | 64.75 | 0.4939 | 66.84 | - | 75.86 | 0.3365 | |
(w/o Mix-up) | 8.7M | ✓ | ✓ | - | 71.91 | 74.66 | 0.3597 | - | 73.39 | 77.80 | 0.3153 | |
8.7M | ✓ | ✓ | ✓ | 56.13 | 59.25 | 59.82 | 0.5367 | 67.08 | 71.80 | 75.71 | 0.3470 | |
DcaseNet-v1 | 8.7M | ✓ | ✓ | 55.19 | 59.53 | - | - | 67.08 | 70.56 | - | - | |
(Fig. 1-(a)) | 8.7M | ✓ | ✓ | 60.27 | - | 63.42 | 0.5097 | 68.73 | - | 78.31 | 0.3079 | |
(w/ Mix-up) | 8.7M | ✓ | ✓ | - | 75.18 | 75.27 | 0.3645 | - | 74.83 | 78.93 | 0.2963 | |
8.7M | ✓ | ✓ | ✓ | 57.01 | 65.22 | 66.99 | 0.4764 | 68.59 | 71.64 | 78.06 | 0.3238 | |
8.9M | ✓ | ✓ | 58.92 | 60.00 | - | - | 69.87 | 69.68 | - | - | ||
DcaseNet-v2 | 8.9M | ✓ | ✓ | 58.79 | - | 69.53 | 0.4416 | 69.07 | - | 78.98 | 0.3058 | |
(Fig. 1-(b)) | 8.9M | ✓ | ✓ | - | 74.54 | 76.46 | 0.3571 | - | 74.76 | 81.32 | 0.2826 | |
8.9M | ✓ | ✓ | ✓ | 59.80 | 62.94 | 66.67 | 0.4817 | 69.57 | 71.38 | 79.26 | 0.2968 | |
13.2M | ✓ | ✓ | 61.08 | 65.03 | - | - | 70.35 | 71.42 | - | - | ||
DcaseNet-v3 | 13.2M | ✓ | ✓ | 59.70 | - | 75.62 | 0.3512 | 69.44 | - | 79.61 | 0.2948 | |
(Fig. 1-(c)) | 13.2M | ✓ | ✓ | - | 76.23 | 77.93 | 0.3232 | - | 75.99 | 79.28 | 0.2958 | |
13.2M | ✓ | ✓ | ✓ | 56.80 | 70.40 | 75.19 | 0.3586 | 69.37 | 74.59 | 78.80 | 0.3185 | |
4.3 Result analysis
Table 2 lists the performance of the three DcaseNet architectures when trained on a single task. The top two rows describe the results from the official DCASE baselines and recent methods [11, 31], both submitted for the DCASE 2020 challenge. Note that the reference performances for TAG cannot be reported because neither an official development nor an evaluation set exists. The proposed architectures show comparable performances with the recent methods, and hence their results are used as baselines in Table 3. In addition, the effect of Mix-up [32] is investigated using the DcaseNet-v1 architecture. The results demonstrate that Mix-up is effective, as it improves the task performance in most cases. Thus, we applied Mix-up for both the DcaseNet-v2 and DcaseNet-v3 architectures.
Table 3 lists the performance after joint training and fine-tuning the three proposed architectures. For each architecture, the baselines are those listed in Table 2. The task combinations for DNN training are indicated with checkmarks (e.g., the first row shows the results of the DNN trained using ASC and TAG). Column ‘Joint training’ lists the results of training the model using different task combinations. Column ‘Fine-tune’ lists the results of initializing the DNN using the jointly trained model and conducting fine-tuning for each task.
For DcaseNet-v1, which resembles an adult’s perception of scenes, the jointly trained model does not generalize well across the three tasks after fine-tuning. Joint training in the three tasks (fourth row) perform worse than the corresponding baseline in all four metrics. In addition, inconsistent results are obtained according to the application of Mix-up. When trained without Mix-up, the performance of TAG and SED improves, whereas with Mix-up, the performance of ASC and TAG improves. Even after fine-tuning, a single DNN that outperforms all three baselines cannot be obtained.
DcaseNet-v2, which resembles a baby’s learning procedure, outperforms DcaseNet-v1. After fine-tuning, DcaseNet-v2 shows higher performance than the corresponding baseline consistently, except for the model that jointly trained ASC and SED. Through a comparison between DcaseNet-v1 and DcaseNet-v2 architectures, we conclude that considering the abstraction level according to the assumed task complexity is more effective than mimicking an adult’s perception mechanism. However, the jointly trained model (bottom row of DcaseNet-v2) has a lower overall performance than the corresponding baseline for each task without fine-tuning.
DcaseNet-v3 achieves the highest performance among the three proposed architectures. The joint training model for the three tasks shows comparable performance to the corresponding baseline on each task. After fine-tuning for each task, DcaseNet-v3 outperforms the baseline except for the ER, which slightly increased from 0.3085 to 0.3185. In addition, joint training of ASC and TAG shows an accuracy of 70.35% after fine-tuning for ASC. Overall, the results demonstrate the effectiveness of the architectures with few layers solely for each task. This outcome is consistent with the hidden layers being required to be fine-tuned when performing transfer learning, even though the pre-trained task has many similarities with the main task.
Remarkably, for TAG, joint training with SED was the most effective approach across the three proposed architectures, possibly due to the close relation between TAG and SED. In addition, for joint training, degradation of accuracy in ASC requires further investigation.
5 Conclusion and Future works
We propose three integrated DNN architectures, DcaseNets, inspired by human cognitive processes to simultaneously perform ASC, TAG, and SED. The first architecture resembles the perception mechanism for acoustic scenes of adults, and the second architecture resembles the learning of babies, with the latter providing higher performance. The third architecture that adds a few layers before the output of each intermediate task further improves the performance of the second architecture. Jointly trained models can be further fine-tuned for each task. Experimental results show the high performance of the proposed DcaseNet-v3 for the three tasks after joint training, outperforming all the corresponding baselines with fine-tuning.
However, as this is the first study that integrates the three related tasks, further investigation and improvements should be performed. As we experimentally verified that DcaseNet-v2 outperforms DcaseNet-v1, we will adopt self-supervised learning aiming to improve the task performance. In addition, we will apply knowledge distillation and leverage soft-label training. Soft-labels will enable cross-dataset training by, for example, generating soft-labels for the ASC dataset and using the result to calculate the TAG loss.
References
- [1] K. Koutini, H. Eghbal-Zadeh, M. Dorfer, and G. Widmer, “The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification,” in European Signal Processing Conference (EUSIPCO). 2019, IEEE.
- [2] S. Mun, S. Park, D. Han, and H. Ko, “Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2017.
- [3] H. Chen, Z. Liu, Z. Liu, P. Zhang, and Y. Yan, “Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling,” Tech. Rep., DCASE2019 Challenge, 2019.
- [4] D. Barchiesi, D. D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic Scene Classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 5 2015.
- [5] V. Bisot, R. Serizel, S. Essid, and G. Richard, “Acoustic scene classification with matrix factorization for unsupervised feature learning,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016, pp. 6445–6449, IEEE.
- [6] J.-w. Jung, H.-S. Heo, I.-H. Yang, S.-H. Yoon, H.-j. Shim, and H.-J. Yu, “DNN-based audio scene classification for DCASE 2017: dual input features, balancing cost, and stochastic data duplication,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2017.
- [7] A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic scene classification: An overview of dcase 2017 challenge entries,” in International Workshop on Acoustic Signal Enhancement (IWAENC). 11 2018, pp. 411–415, IEEE.
- [8] M. D. Plumbley, C. Kroos, J. P. Bello, G. Richard, and D. P. Ellis, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE), Tampere University of Technology, 2018.
- [9] M. Mandel, J. Salamon, and D. P. Ellis, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE), New York University, 2019.
- [10] J.-w. Jung, H.-j. Shim, J.-h. Kim, S.-b. Kim, and H.-J. Yu, “Acoustic Scene Classification using Audio Tagging,” in Annual Conference of the ISCA, INTERSPEECH, 2020.
- [11] J.-h. Kim, J.-w. Jung, H.-j. Shim, and H.-J. Yu, “Audio Tag Representation Guided Dual Attention Network for Acoustic Scene Classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2020.
- [12] C. Guastavino, “Categorization of environmental sounds Musikiosk View project Exploring sensory experience: Meaning and the senses View project,” Canadian Journal of Experimental Psychology, vol. 61, no. 1, pp. 54, 2007.
- [13] K. Imoto, N. Tonami, Y. Koizumi, M. Yasuda, R. Yamanishi, and Y. Yamashita, “Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, pp. 621–625, IEEE.
- [14] N. Tonami, K. Imoto, M. Niitsuma, R. Yamanishi, and Y. Yamashita, “Joint analysis of acoustic events and scenes based on multitask learning,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 10 2019, vol. 2019-Octob, pp. 338–342, IEEE.
- [15] R. Caruana, “Multitask Learning,” Machine Learning, vol. 28, pp. 41–75, 1997.
- [16] Y. Bengio, G. E. Hinton, and Y. LeCun, “Aaai2020 invited speaker program,” 2020, https://aaai.org/Conferences/AAAI-20/invited-speakers/.
- [17] L. Jing and Y. Tian, “Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–22, 5 2020.
- [18] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 1422–1430.
- [19] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3 2009, pp. 248–255, IEEE.
- [20] J.-w. Jung, H.-S. Heo, H.-j. Shim, and H.-J. Yu, “Distilling the Knowledge of Specialist Deep Neural Networks in Acoustic Scene Classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2019, pp. 114–118.
- [21] J.-w. Jung, H.-S. Heo, H.-j. Shim, and H.-J. Yu, “Knowledge Distillation in Acoustic Scene Classification,” IEEE Access, vol. 8, pp. 166870–166879, 9 2020.
- [22] P. Primus and D. Eitelsebner, “Acoustic Scene Classification with Mismatched Recording Devices,” Tech. Rep., DCASE2019 Challenge, 2019.
- [23] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen, “Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2018.
- [24] M. Kosmider, “Calibrating neural networks for secondary recording devices,” Tech. Rep., DCASE2019 Challenge, 2019.
- [25] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Lecture Notes in Computer Science. 2016, vol. 9908 LNCS, pp. 630–645, Springer Verlag.
- [26] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley, “Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). 2019, pp. 30–34, New York University.
- [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML). 2015, vol. 1, pp. 448–456, IMLS.
- [28] O. Akiyama and J. Sato, “DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2019, pp. 25–29.
- [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
- [30] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
- [31] T. N. T. Nguyen, D. L. Jones, and G. W. Seng, “DCASE 2020 Task 3: Ensemble of Sequence Matching Networks for Dynamic Sound Event Localization, Detection, and Tracking,” Tech. Rep., DCASE2020 Challenge, 2020.
- [32] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “MixUp: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), 2018.