Curriculum Audiovisual Learning

Di Hu Big Data Lab, Baidu Research Zheng Wang School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected] Haoyi Xiong Big Data Lab, Baidu Research Dong Wang School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected] Feiping Nie School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected] Dejing Dou Big Data Lab, Baidu Research

Abstract

Associating sound and its producer in complex audiovisual scene is a challenging task, especially when we are lack of annotated training data. In this paper, we present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. To ease the difficulty of audiovisual learning, we propose a novel curriculum learning strategy that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation task. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision. Our video demo can be found at https://youtu.be/kuClfGG0cFU.

1 Introduction

Audiovisual concurrency provides potential cues for perceiving and understanding the outside world [16]. Such concurrency comes from the simple phenomena of “Sound is produced by the oscillation of object” [17], and exists through our daily life, such as the talking crowd, the barking dog, the roaring machine, etc. These inherent and pervasive correspondences provide us the reference to distinguish and correlate different audiovisual messages, then contribute to learning diversified visual appearances from their produced sounds, or perceiving various acoustic signals from their diversified sound-makers, i.e., we can visually localize the source position by hearing the sound or separate sound from chaotic acoustic scene with visual guidance.

Refer to caption — Figure 1: An illustration of the heterogeneous audiovisual (AV) scene complexity. The simple scene contains only one sound-source, while the complex scene contains multiple sound-sources.

As expected, machine intelligence should also take advantage of the inherent audiovisual concurrency for possessing human-like audiovisual processing ability. In recent years, some pioneering works have been developed to address this challenge, evolving from cross-modal knowledge transfer [4, 24] to directly determining whether audiovisual messages are corresponding or not [1, 17, 20]. However, the learning capacity of these models is pervasively limited by the heterogeneous complexity of audiovisual scene, i.e., arbitrary number of sound-sources, as shown in Fig. 1. On the one hand, it is easy to align sound and its visual source in the simple scene with single sound, whereas more difficult for the complex one with multiple sounds as lack of one-to-one audiovisual alignment annotations. On the other hand, most of these existing models indiscriminately utilize these simple and complex audiovisual data, which could confuse the models when analyzing and aligning diversified audiovisual content without auxiliary annotations. Hence, we argue that differentiated audiovisual learning w.r.t. their heterogeneous complexity should be explored for achieving robust audiovisual perception.

Further, effective audiovisual learning can reward the ability of achieving interesting cross-modal perception tasks, i.e., audiovisual sound localization and separation [23]. Recent approaches that focus on these tasks have shown considerable performance [2, 12, 26, 31]. However, these methods usually focus on simple audiovisual scene and cannot derive concrete representations for specified sound-maker (i.e., object). Moreover, in the audiovisual sound separation task, these works normally consider that external visual knowledge from supervised training is necessary for effective separation guidance [31, 12, 30], but we argue that it should be an unnecessary procedure.

In this paper, we strive to make a step towards to the goal of human-like audiovisual learning ability. The primary challenge is how to distinguish and align different sounds and sound-makers just with the scene-level consistency, especially when faced with the heterogeneous audiovisual complexity. The second challenge is how to derive effective visual representations for various sound-makers without referring external knowledge, then server them to audiovisual sound localization and separation tasks.

To address these two challenges, we propose to grade the heterogeneous scene into a set of audiovisual curriculum in different difficulty levels and perform differentiated audiovisual learning, from the easy ones to the hard ones. The core insight is that we can easily analyze and align the audiovisual content in the simple scenes with single sound, meanwhile it can also provide prior alignment knowledge for the learning in complex scenes. As the audiovisual alignment is inferred by grouping and comparing the distinctive channel responses of both modalities, we can further use the aligned visual representations of sound-maker for audiovisual sound localization and separation. Concretely, our contributions are threefold:

•

We develop a flexible audiovisual learning model that derives effective unimodal representations and infers the latent alignment between sound and sound-maker for simple and complex scene. This model imposes a soft-clustering module as the pattern detector and correlates the clustered patterns via a structured alignment objective in a space shared by both modalities.
•

We propose a novel Curriculum AudioVisual Learning (CAVL) strategy, where the difficulty level is determined by the number of sound-source in the scene. Amounts of experimental results show that such simple learning strategy not only makes our model much easier to train, but also improves the learning and alignment performance of audiovisual contents. Besides, we also develop a counting model for estimating the audiovisual scene complexity.

•

We further deploy the learned audiovisual model in cross-modal perception tasks. In audiovisual sound localization, our model shows considerable improvements over existing models. Moreover, it can provide effective visual representation for sound separation, based on which our approach has comparable performance to the methods that utilize external visual knowledge from supervised pre-training.

2 Related Work

Audiovisual self-supervised learning As the audiovisual concurrency is inherently free of human-annotation and can provide correspondence signal of two modalities, it has drawn great attention in the self-supervised learning community [23], where self-supervised learning means training model from the input data itself and without human-annotation. In the early stage, the research group in MIT firstly regards the audiovisual concurrency as the connection of cross-modal knowledge transfer, where the student-network of one modality is supervised with the prediction of teacher-network of the other modality [4, 24]. Recently, a novel learning criterion is proposed to directly model the audiovisual concurrency without teacher supervision [1, 20], i.e., the audiovisual network learns to predict the sound and image from the same video or not. It is surprising that, with such simple supervision, both modality networks can learn to response to specific visual appearance and acoustic message [1, 20]. But on the other hand, as the scene-level consistency is lack of specific annotations between audiovisual components, it usually works well in the simple scenes with single sound but suffers from the defects of inefficiency and inaccuracy in the complex scene [17]. Compared with these approaches, our method can learn to align multiple audiovisual components, even faced with complex audiovisual scene.

Sound localization in visual modality Visually localizing the sound-maker is a typical audiovisual perception task. Existing approaches address this challenge mainly by correlating pixel and sound based on their concurrency, where canonical correlation analysis [18], embedded scalar products [2], attention mechanism [26], and class activation map [23] have been proposed for effectively identifying the pixels of sound. Although these methods have shown promising visualization outcome, we consider that localization results should be more than that. We can also learn effective visual representation of sound-source for further perception of vision and sound, which is expected but ignored previously.

Audiovisual sound separation The visual information of sound-maker are considered to provide effective reference for separating the corresponding sound from complex scene [3]. Motivated by this, a number of approaches have been developed for achieving robust audiovisual sound separation in different types of sound, which range from speech separation [10, 23], music separation [12, 30, 29] to object sound separation [11, 17]. To achieve considerable performance in realistic environment, these methods resort to the reliable visual representation of sound-maker, which are obtained from ImageNet pre-trained visual network [11, 31, 29] or off-the-shelf object detector [10, 12], then correlate them with the sound embeddings in a common space. Compared to these methods, our approach does not need to refer external visual knowledge trained with human annotations.

Curriculum learning Compared with a disordered arrangement, starting from easier samples or tasks then gradually increasing the difficulty level could contribute to better learning performance [9, 6], which is called curriculum learning and has been widely applied in image classification [14], natural language processing [8], speech recognition [7]. Different from conventional supervised learning, self-supervised learning is lack of effective human-annotation, which is more vulnerable to training order [22]. Recently, Korbar et al. [20] employed curriculum learning for choosing the negative samples of audiovisual temporal synchronization. However, there is few work focusing on how to improve the audiovisual learning performance with curriculum learning strategy, when faced with heterogeneous scene complexity.

3 Approach

3.1 Audiovisual Learning Model

Given synchronized audio and visual messages¹¹1We use sound spectrogram and image for representing audio and visual message, respectively., i.e., $\{{a_{1}},{a_{2}},...,{a_{n}}\}$ and $\{{v_{1}},{v_{2}},...,{v_{n}}\}$ , which are separated from unlabelled videos $\{\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{n}\}$ , we target to effectively train an audiovisual network from cold-start and make it possess the ability of generating robust unimodal representation and performing effective cross-modal perception. The whole framework is shown in Fig. 2.

3.1.1 Learning representation via clustering

As the filters of convolution networks have shown the property of class-relevant activation [32, 17], it becomes feasible to discover and disentangle the audio and visual components by analyzing and integrating their channel representations. Concretely, we first employ convolution networks to model each modality then embed the data into feature maps with size $H\times W\times C$ , where $H$ and $W$ are the frame size and $C$ is the number of channels. Then, these feature maps are reshaped into a set of vector $\{{x_{1}},{x_{2}},...,{x_{H\times W}}|x_{i}\in R^{C}\}$ . Due to the distinct activation of different modal components, some of the reshaped feature vectors in the $C$ -dimension channel space should take similar distribution when they describe the same modal components but dissimilar for different components. Hence, we propose to integrate these feature vectors by performing soft K-means clustering [5] in the channel space for each modality.

Formally, the objective function of soft K-means clustering [5] can be formulated as

J=\sum\limits_{i=1}^{H\times W}{\sum\limits_{j=1}^{k}{{w_{ij}}d\left({{x_{i}},{c_{j}}}\right)}},

(1)

where $c_{j}\in R^{C},j=1,2,...,k$ is the $j-$ cluster center, $k$ is the number of sound-sources in the audiovisual scene, $d\left({{x_{i}},{c_{j}}}\right)$ is the Euclidean distance between current feature point $x_{i}$ and center $c_{j}$ for measuring their similarity. $w_{ij}\in[0,1]$ is the indicator variable, which indicates the degree of assignments and can be achieved by performing softmax function over the distance of $d\left({{x_{i}},{c_{j}}}\right)$ , i.e.,

{w_{ij}}=\frac{{{e^{-\beta d\left({{x_{i}},{c_{j}}}\right)}}}}{{\sum\limits_{l=1}^{k}{{e^{-\beta d\left({{x_{i}},{c_{l}}}\right)}}}}},

(2)

where the hype-parameter $\beta>0$ is called stiffness parameter and controls the scalability of assignments.

Eq. 1 is a minimization problem about two parts, i.e., the assignments and centers. The Expectation-Maximization algorithm can be employed for solving it effectively [21]. In the E-step, we fix the cluster centers $C$ and update the assignment $w_{ij}$ via Eq. 2 . In the M-step, we fix the assignment $w_{ij}$ and re-compute the centers with the updated assignments from E-step, i.e.,

{c_{j}}={{\sum\limits_{i=1}^{H\times W}{{w_{ij}}{x_{i}}}}\mathord{\left/{\vphantom{{\sum\limits_{i=1}^{H\times W}{{w_{ij}}{x_{i}}}}{\sum\limits_{i=1}^{H\times W}{{w_{ij}}}}}}\right.\kern-1.2pt}{\sum\limits_{i=1}^{H\times W}{{w_{ij}}}}}.

(3)

By alternatively executing E- and M-step, we aim to find $k$ centers, each of which should correspond to certain modal component, such as specific object or sound. Meanwhile, the corresponding cluster assignment can be interpreted as a spatial-mask over feature map and indicates the location of sound-source in both modalities, as shown in Fig. 2.

3.1.2 Audiovisual alignment objective

For a given audiovisual scene, although the contained sounds and objects have been described as different clustering centers, it is still difficult to directly perform alignment between them only with supervision at the entire scene level. Fortunately, the pervasive concurrency of sound and sound-maker can help to infer the latent alignment by comparing the matching degree of different sound-object pairs, where the valid pair should possess higher matching degree. Concretely, for each audio clustering center $c_{i}^{a}$ , we aim to find proper visual clustering center (i.e., sound-maker) based on their concurrency, which can be formulated as

{S_{{a}{v}}}=\sum\limits_{i=1}^{{k_{a}}}{\mathop{\min}\limits_{j\in\{1,2,...,{k_{v}}\}}{{\left\|{c_{i}^{{a}}-c_{j}^{{v}}}\right\|}_{2}}},

(4)

where $k_{a}$ and $k_{v}$ are the number of audio center and visual centers, respectively²²2As the visual background is usually irrelevant with sound, we set $k_{v}=k_{a}+1$ and use additional visual center to represent it.. By minimizing Eq. 4, each audio center is aligned to the nearest visual center, meanwhile we can also derive the similarity score of sounds and objects in the entire scene, i.e., $S_{av}$ .

For an arbitrary scene, the self-supervision signal only confirms whether the audio and visual information are from the same scene (video) or not. To effectively leverage such supervision, we employ the contrastive loss to train the audiovisual network and infer the latent alignment simultaneously, which has shown the property of consistency and robust in two-stream network optimization [19, 20]. Concretely, the contrastive loss is written as

	$\displaystyle L_{av}=\frac{1}{{2n}}$	$\displaystyle\sum\limits_{i,j=1}^{n}({\delta_{i=j}}S_{{a_{i}}{v_{j}}}^{2}+$		(5)
		$\displaystyle(1-{\delta_{i=j}})\max{{(margin-{S_{{a_{i}}{v_{j}}}},0)}^{2}})$		(5)

where $a_{i}$ and $v_{j}$ stand for the sound and image from scene $i$ and $j$ , respectively. $\delta_{i=j}$ is an indicator of each sound-image pair, i.e., $\delta_{i=j}=1$ if $i=j$ , otherwise $0$ . In practice, the negative samples of $i\neq j$ are randomly sampled from the training set. Generally, Eq. 5 encourages the audiovisual network to have higher matching confidence for the aligned sound-image pair than the mismatched ones by introducing the hyper-parameter of $margin$ .

3.2 Curriculum Learning

3.2.1 Curriculum Procedure

Usually, the audiovisual scenes in the wild contain different amounts of sound-sources, we find that directly performing audiovisual learning with these data will make the model very difficult to optimize and also lower the alignment performance. To settle this problem, we propose to train the audiovisual model step by step, which is about starting from simple scene then gradually increasing the difficulty level, where the number of sound-sources is considered as the reference for audiovisual scene complexity. Intuitively, for the simple scene with single sound-source, it is easy to visually localize the sound-maker from background then align it to the unique sound, such as the example of accordion in Fig. 1. By contrast, we find that if training with complex audiovisual scenes (e.g., with three sound-sources) from the beginning, the model will get much lower convergence speed and worse results. While, model trained with simple scenes can contribute prior knowledge for distinguishing different sound-makers and sounds, and also provides the reference for alignment . Therefore, the audiovisual learning model can be further optimized with the complex scene, which leads to better results.

In practice, to effectively perform curriculum learning, all the audiovisual data have been sorted from simple to complex before training, according to the number of sound-sources in the scene. For different learning stages, the cluster number is accordingly set to the number of sources, e.g., $k_{a}=1$ and $k_{v}=2$ for the scene with single source. Based on these graded audiovisual data, we can train the audiovisual learning model in a curriculum fashion.

3.2.2 Complexity Estimation

Since the audiovisual scene complexity is crucial for curriculum training, it is worth to learn to model and estimate the number of sound-source in a given scene. Formally, the discrete probability distribution of Poisson for counting data $y_{i}$ is given by $P({Y_{i}}={y_{i}})=\frac{{{e^{-{\lambda_{i}}}}\lambda_{i}^{{y_{i}}}}}{{{y_{i}}!}},{y_{i}}=1,2,...,$ where $\lambda_{i}$ is interpreted as the expected number of events in the interval and ${y_{i}}!$ is the factorial of $y_{i}$ . In this task, $y_{i}$ is performed as the number of sound source in the audiovisual scene. Consequently, we propose to model $\lambda_{i}$ as a function of the input sound $a_{i}$ by the audio network, which is written as ${\lambda_{i}}=f(a_{i})$ . The function $f(\cdot)$ means the counting network of sound-sources. By taking the negative log-likelihood w.r.t. the Poisson distribution, we can have the Poisson regression loss

{L_{p}}=\sum\limits_{i=1}^{n}{(f(a_{i})-{y_{i}}\log f(a_{i}))+{\rm{ln}}({y_{i}}!)},

(6)

where the term of ${\rm{ln}}({y_{i}}!)$ can be ignored, as it is a constant to the model training. After training the counting network, we can estimate the scene complexity by identifying the number that holds the maximal probability.

3.3 Audiovisual Perception

3.3.1 Localizing sounds in visual modality

Considering that the audiovisual learning model has learned to align objects and sounds in the training phase, we can directly identify the potential object which produces given sound by comparing their similarity, i.e.,

c_{source}^{v}=\mathop{\arg\min}\limits_{j=1,2,...,{k_{v}}}{\left\|{c_{source}^{a}-c_{j}^{v}}\right\|_{2}}.

(7)

For the clustering center of sound-sources $c_{source}^{a}$ , we compare it with all the visual centers $\{{c_{1}^{v}},{c_{1}^{v}},...,{c_{k_{v}}^{v}}\}$ and select the closest one as visual representation of sound source. As the corresponding assignment $w_{\cdot source}$ indicates the correlation between all the visual feature vectors and $c_{source}^{v}$ ,we can reshape it back to the size of $H\times W$ and regard it as the location mask of sound-maker to achieve visual localization. To better visualize the object position, we can also further resize the assignment to the size of input image.

3.3.2 Audiovisual sound separation

To validate the effectiveness of inferred object representation further, we propose to perform sound separation based on visual guidance. The representative audiovisual separation network in [10, 12] is adopted, as shown in Fig. 6. Concretely, the separation network takes the visual clustering center (i.e., $c_{i-source}^{v}$ ) as the sound-maker representation in $i-$ scene, and targets to separate its produced sound from the mixed audio signal. Alternatively, we can also use the assignment $w_{\cdot source}$ to point out the location of sound-maker, and regard it as the object mask over the visual feature maps. Then, a sound-maker-awareness max-pooling can be performed over the masked feature maps to obtain robust object representation.

A variant of U-Net [25] is used to perform sound-source separation, similar to [12, 31]. The network takes the spectrogram of $i-$ mixed sound $a_{i}^{mix}$ as the input, then encodes it into audio feature maps via stacked convolution layers. The replication and tiling operation is performed over the visual representation to match the size of embedded audio feature maps. Then we concatenate these two modalities, and feed them into stacked up-convolution layer to generate a spectrogram mask. The separation loss is written as

{L_{s}}=\sum\limits_{i=1}^{n}{{{\left\|{{M_{i}}-g\left({{{a}_{i}^{mix}},c_{i-source}^{v}}\right)}\right\|}_{1}}},

(8)

where $g(\cdot)$ means the audiovisual sound separation network and $M_{i}$ is the mask of the spectrogram magnitudes of the target sound $a_{i}$ and mixed sound ${{a}_{i}^{mix}}$ , i.e., ${M_{i}}={{{m^{{a_{i}}}}}\mathord{\left/{\vphantom{{{m^{{a_{i}}}}}{{m^{a_{i}^{mix}}}}}}\right.\kern-1.2pt}{{m^{a_{i}^{mix}}}}}$ . With the masked spectrogram, we can use Inversed Short-Time Fourier Transform (ISTFT) to produce separated sound signal w.r.t. specific sound-maker.

4 Experiments

4.1 Datasets

AudioSet-Balanced Audioset[13] is an audio event dataset, which consists of 2,084,320 human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. Such hierarchical annotation makes it extremely difficult to precisely estimate the number of sound-sources in the clip. Hence, we propose to filter the original annotation by just keeping the third-level labels³³3Detailed filtering process can be found in the supplemental materials., e.g., Dog. In the original setting, all the videos are splitted into Evaluation/Balanced-Train/Unbalanced-Train set. For efficiency purpose, we only use the Balanced-Train set for training. These data are divided into different curriculums according to the number of contained sound-sources, e.g., the first curriculum $\mathcal{C}_{1}$ consists of videos with single sound-source. Finally, all the 19,443 valid video clips are divided into 9,239/7,098/2,685/421 $\mathcal{C}_{1}$ / $\mathcal{C}_{2}$ / $\mathcal{C}_{3}$ / $\mathcal{C}_{4}$ curriculum clips. For each curriculum set, the input audio is a 10s mono sound and the image is randomly selected from video. Note that, all the semantic labels are not used during training.

MUSIC The MIT MUSIC dataset contains 685 videos, with 536 musical solo and 149 duet videos. These videos contain 11 instrument categories. They are also collected from YouTube, but cleaner than the ones in AudioSet. Hence, they are more proper for the sound separation task. As the duet videos do not have ground-truths of sounds in mixtures, we only use the solo videos for training and testing. Following [12], the first and second video of each instrument category is selected for validating and testing, respectively. And the rest ones in solo are for training. Note that, some videos have been removed by the YouTube users, the final training data are about 467 videos. All the videos are randomly splitted into 10s clips.

4.2 Network and Implementation Details

Our audiovisual learning network is a two-stream network, where the off-the-shelf VGGish network [15] is employed for sound and VGG16 [27] is for vision. For each modality, the feature maps are the outputs of final convolution layer of the network. Detailed architecture description for both audiovisual learning model and separation model can be found in the supplemental materials.

For all the experiments, the input audio of 10s long is represented in log magnitude spectrogram of $512\times 432$ , which is achieved by STFT (with the window size of 1022 and hop length of 256) and log-frequency projection. For the visual modality, we directly reshape the input image into $256\times 256\times 3$ . The network is trained with Adam optimizer, where the starting learning rate is $10^{-4}$ for the first curriculum, then gradually times $10^{-1}$ for the next one. For example, the learning rate for the third curriculum is $10^{-6}$ .

4.3 Curriculum Learning Evaluation

4.3.1 Learning comparison

In this section, we aim to have an insight into how the curriculum strategy influences the audiovisual learning performance. Concretely, to evaluate the effects of different audiovisual complexities, the original set and the curriculum set of $\mathcal{C}_{1}$ and $\mathcal{C}_{3}$ are selected for training the audiovisual network, respectively. As shown in Fig. 4, it is obvious that the network trained with the simple curriculum of $\mathcal{C}_{1}$ enjoys the fastest convergence and lowest training loss, while the one trained with $\mathcal{C}_{3}$ suffers from the worst performance. Such phenomena indicate that the model performance is significantly affected by the audiovisual complexity, and the simple scene can provide better learning performance.

Further, we want to know what the model can benefit from pre-curriculum, i.e., the effects of curriculum initialization. In Fig. 4, we show the training accuracy of audiovisual model on the $\mathcal{C}_{3}$ set, which are initialized from random and the model trained with the $\mathcal{C}_{2}$ set, respectively. Surprisingly, the model initialized from $\mathcal{C}_{2}$ enjoys the great advantages compared with the random one. Curriculum learning indeed helps to accelerate and improve the audiovisual learning performance.

4.3.2 Acoustic Scene Classification

The unimodal representation learned by audiovisual model is also influenced by the curriculum strategy. To assess such influence, we propose to perform acoustic scene classification by viewing the trained audiovisual model as a feature extractor. The ESC-50 dataset is chosen for evaluation and we follow the same pre-processing and train/test split as [17]. Table 1 show the comparison results, where the sound-source-level alignment is Eq. 5 and the scene-level alignment is directly comparing the audio and visual representation without clustering, similar to [1]. In Table 1, we can summarize these results into three points. First, learning with simple curriculum can provide more proper initialization. Second, sound-source matching better utilizes the audiovisual concurrency than scene-level matching, especially in the complex scene. Third, direct video-level matching in the complex scene may deteriorate the pre-trained network. This is probably because the chaotic audiovisual correlation could confuse the scene matching objective, but it was ignored before.

Training Strategy	Accuracy $\uparrow$
$\mathcal{C}_{1}$	51.25
$\mathcal{C}_{2}$ + Sound-source-level alignment	56.75
$\mathcal{C}_{2}$ + Scene-level alignment	47.25

Table 1: Acoustic scene classification result on ESC-50.

4.3.3 Poisson Regression

In this section, we evaluate the performance of audiovisual complexity estimation. Table 2 shows the Poisson regression results when training on AudioSet-Balanced-Train and testing on AudioSet-Evaluation, where the number of sound source ranges from 1 to 5. Compared to the chance results, the audio Poisson regression network has a great superiority in both accuracy and Mean Average Error (MAE). Moreover, the results can be further improved by adopting the network pre-trained with audiovisual learning objective, i.e., Eq. 5. Intuitively, if the network has been trained with higher level curriculum, e.g., $\mathcal{C}_{2}$ , it will better estimate the number of sound-sources in the audio modality. Such evidences show that complex audiovisual data can be effectively modeled by self-supervised learning method, especially when adopting the curriculum learning strategy.

Approach	Accuracy $\uparrow$	MAE $\downarrow$
Random	20.2	1.624
w/o Pre-train	45.2	0.742
CAVL- $\mathcal{C}_{0}$	47.0	0.701
CAVL- $\mathcal{C}_{1}$	49.6	0.616
CAVL- $\mathcal{C}_{2}$	50.9	0.614

Table 2: Poisson regression results. CAVL-

\mathcal{C}_{0}

is a preliminary curriculum, in which the model is trained with scene-level alignment on the

\mathcal{C}_{1}

dataset.

4.4 Audiovisual Sound Localization

In this task, we aim to visualize the object location where the sound is produced. The AudioSet-Balanced-Train dataset is adopted for training, and the human-annotated SoundNet-Flickr [4, 26] dataset is used for testing. As the training and testing datasets come from different sources, it is more challenging to perform exact localization.

Fig. 5 shows some qualitative examples w.r.t. sound-source locations. Compared with the human-annotations, our model can predict proper visual localization for the corresponding sound. And we can find that the annotated bounding boxes are sometimes too rough to provide exact source locations, while our model can address this challenge due to the clustering advantage of spatial assignments.

We further perform quantitative evaluation. Following [26], the same 250 audiovisual pairs are selected from the annotated SoundNet-Flickr dataset for evaluation, and consensus Intersection over Union (cIoU) and AUC area [26] are used as evaluation metrics. To evaluate the effectiveness of curriculum learning, our models trained in different curriculum levels are also considered, as shown in Fig. 3. First, our models outperform all the other methods by a large margin. It demonstrates that our model can better capture and align different sound-sources, even faced with multi-source scenes. Second, besides the aligned visual center, we also evaluate the unaligned visual center. As expected, they suffer from a large decline in both metrics, which indicates that our model can exactly distinguish sound-maker from background and align it with the produced sound. Third, our model trained with curriculum $\mathcal{C}_{2}$ is worse than the one with $\mathcal{C}_{1}$ . This is because the test videos are all single-source, the multi-source videos in $\mathcal{C}_{2}$ may mix up the alignment knowledge learned in $\mathcal{C}_{1}$ .

Methods	[email protected] $\uparrow$	AUC $\uparrow$
Random	12.0	32.3
Attention[26]	43.6	44.9
DMC[17]	41.6	45.2
CAVL- $\mathcal{C}_{1}$	50.0	49.2
CAVL- $\mathcal{C}_{1}$ (unrelated)	19.2	36.8
CAVL- $\mathcal{C}_{2}$	46.0	45.7

Table 3: Quantitative localization results on SoundNet-Flickr dataset [26]. AUC is the area under the cIoU curve.

4.5 Sound Separation

We evaluate the audiovisual sound separation performance on the MIT-MUSIC dataset and more results on AudioSet are in the supplemental materials. In this task, effective separation depends on the quality of visual representation of sound-maker. To address this challenge, most of existing methods resort to the ImageNet pre-trained or fine-tuned sound-maker detector [11, 12, 30, 31]. In contrast, we use the sound localization technique to automatically extract the visual representation of sound-maker, which is implemented with audiovisual alignment objective (i.e., Eq. 5) without any human-supervision. Fig. 7 shows some examples of solo and duet scene. It is obvious that our model can localize most instruments, especially the different instruments in the duet scene. Then, we can use the corresponding visual centers or masked visual features as the representation of sound-maker for sound separation, as introduced in Sec.3.3.2. Note that, as the extraction of visual representation does not use any extra human-annotation, it is more general and flexible than existing methods.

In order to evaluate the separation results accurately, we use the synthetic mixture audios for evaluation, similar to [30, 12]. The standard metrics of Signal-to-Distortion Ration (SDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) are used for evaluation. Our model is compared with the audio-only separation method of NMF-MFCC [28] and the audiovisual separation models of AV-Mix-Sep [12], Sound-of-Pixels [31] and Co-Separation [12]. Table 4 shows the separation results, where sound localization model is trained with solo videos. Although the previous methods use additional visual knowledge for guiding the sound separation, e.g., ImageNet-pretrained visual model in AV-Mix-Sep [31] and finetuned instrument detector in Co-Separation [12], our model still shows comparable results in both SDR and SIR⁴⁴4SAR measures the artifacts in the separated results instead of separation accuracy [12].. Note that, our results are achieved with fewer training samples compared with others, which demonstrates that our sound localization technique can provide effective visual representation of specific sound-maker. Moreover, our separation model based on masked visual features performs much better than the one with visual center. This is probably because the masked visual features could provide more detailed representation of sound-maker than the aggregated one. To validate the effectiveness of curriculum learning, the sound localization model is further trained with the duet videos. Table 5 shows the ablation results. The performance gain in SDR and SAR indicates that our audiovisual alignment model can utilize complex scenes to improve the ability of cross-modal perception further.

Methods	SDR $\uparrow$	SIR $\uparrow$	SAR $\uparrow$
NMF-MFCC[28]	0.92	5.68	6.84
AV-Mix-Sep[12]	3.16	6.74	8.89
Sound-of-Pixels[31]	7.30	11.90	11.90
Co-Separation[12]	7.38	13.7	10.80
Our-Model-Center	5.79	9.15	12.29
Our-Model-Mask	6.59	10.10	12.56

Table 4: Sound separation results on MIT-MUSIC-solo test set. All the methods are trained only with solo videos.

Methods	SDR $\uparrow$	SIR $\uparrow$	SAR $\uparrow$
Our-Model-solo	6.59	10.10	12.56
Our-Model-both	6.78	10.62	12.19

Table 5: Sound separation results on MIT-MUSIC-solo test dataset, where Our-Model-solo is trained only with solo videos, while Our-Model-both is trained with both solo and duet videos.

5 Conclusion

In this paper, we developed an audiovisual learning model that discovers, then aligns the sounds and sound-makers in arbitrary audiovisual scenes. A curriculum learning strategy is proposed to effectively train the model w.r.t. the number of sound-source. Further, we deployed the well-trained audiovisual model into practical perception tasks. We achieved noticeable audiovisual localization performance, and the localized object representation made a considerable boost to sound separation.

References

[1] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617. IEEE, 2017.
[2] Relja Arandjelović and Andrew Zisserman. Objects that sound. arXiv preprint arXiv:1712.06651, 2017.
[3] Barry Arons. A review of the cocktail party effect. 1992.
[4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892–900, 2016.
[5] Christian Bauckhage. Lecture notes on data science: Soft k-means clustering. Technical report, Technical Report, Univ. Bonn, DOI: 10.13140/RG. 2.1. 3582.6643.
[6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
[7] Stefan Braun, Daniel Neil, and Shih Chii Liu. A curriculum learning method for improved noise robustness in automatic speech recognition. In Signal Processing Conference, 2017.
[8] Ronan Collobert, Koray Kavukcuoglu, Jason Weston, Leon Bottou, Pavel Kuksa, and Michael Karlen. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(1):2493–2537, 2011.
[9] Jeffrey L. Elman. Learning and development in neural networks: the importance of starting small. Cognition, 48(1):71–99, 1993.
[10] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
[11] Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665, 2018.
[12] Ruohan Gao and Kristen Grauman. Co-separating sounds of visual objects. arXiv preprint arXiv:1904.07750, 2019.
[13] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017.
[14] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang. Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
[15] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 131–135. IEEE, 2017.
[16] Nicholas P Holmes and Charles Spence. Multisensory integration: space, time and superadditivity. Current Biology, 15(18):R762–R764, 2005.
[17] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9248–9257, 2019.
[18] E Kidron, Y. Y Schechner, and M Elad. Pixels that sound. In IEEE Computer Society Conference on Computer Vision & Pattern Recognition, 2005.
[19] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.
[20] Bruno Korbar, Du Tran, and Lorenzo Torresani. Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230, 2018.
[21] Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47–60, 1996.
[22] Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. Cassl: Curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6453–6460. IEEE, 2018.
[23] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641, 2018.
[24] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, pages 801–816. Springer, 2016.
[25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[26] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849, 2018.
[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[28] Martin Spiertz and Volker Gnann. Source-filter based clustering for monaural blind source separation. In Proceedings of the 12th International Conference on Digital Audio Effects, 2009.
[29] Xudong Xu, Bo Dai, and Dahua Lin. Recursive visual sound separation using minus-plus net. arXiv preprint arXiv:1908.11602, 2019.
[30] Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. The sound of motions. arXiv preprint arXiv:1904.05979, 2019.
[31] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. arXiv preprint arXiv:1804.03160, 2018.
[32] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.

6 Supplemental Materials

6.1 Curriculum Settings

The audio event dataset of AudioSet is annotated via a hierarchical ontology. For example, the sound of barking (the fourth-level label) is simultaneously annotated as Animal (the first-level label), Pets(the second-level label), and Dog(the third-level label), i.e., Animal $>$ Pets $>$ Dog $>$ Bark. Such hierarchical annotation makes it extremely difficult to precisely estimate the number of sound-sources in the clip. Considering that the third-level labels generally relate to the common objects in our surroundings, we propose to filter the original annotation by just keeping the third-level ones. This filtering process consists of two steps, i.e., removing the father annotations and reducing the children annotations. Concretely, for each third-level label of the video clip, we remove all its father annotations (the first and second label) if they appear in the label list of the same video clip. Meanwhile, for each fourth-level label, we reduce them into their father annotation of the corresponding third-level label then remove them. For example, we can find that the third-level label for Bark is Dog via the hierarchical ontology, then we use Dog to replace Bark. After finishing these two steps, the annotation list for each video clip should only contain the labels in the third-level. Finally, we can use the number of labels as the indicator of the number of sound-sources.

6.2 Audiovisual Learning Network

The audiovisual learning network is a two-stream network, which consists of a audio pathway and a visual pathway. We use the off-the-shelf VGGish network for the audio pathway, where we discard its last three fully-connected layers and last max-pooling layer. The output of the audio pathway is the feature maps of the final convolution layer of VGGish, with the size of $64\times 54\times 512$ . The similar procedure is performed with the visual pathway, but the model is replaced with VGG16, and the size of output visual feature maps is $16\times 16\times 512$ .

For the output features of both modalities, we use the Reshape operator to transform them into a set of vectors, i.e., reshaping from $64\times 54\times 512$ to $3456\times 512$ for audio and from $16\times 16\times 512$ to $256\times 512$ for visual modality. We use a fully-connected layer of $512-512$ to encode these reshaped features in the channel space. Then, the modality-specific clustering module is performed, with which we expect to discover concrete audio and visual contents. And the contrastive loss is used to train the whole network.

6.3 Poisson Regression Network

The Poisson regression network is developed based on the audio network of VGGish. We remove the last three fully-connected layers of VGGish but use GlobalMaxPooling over the output feature maps to obtain a $512D$ feature vector. Then, two fully-connected layers of $512-512-1$ are employed to project the feature vector into a predicted value for the Poisson average value of $\lambda$ . Finally, we can train the regression network w.r.t. the Poisson regression loss. The SGD optimizer is used, whose momentum is set to $0.9$ and the learning rate is initialized at $1e^{-2}$ and decays by $0.01/(1+epoch*0.5)$ .

6.4 Sound Separation Network

Our audiovisual sound separation network consists of two parts, one is for visual representation extraction of sound-maker and the other is for sound separation. Concretely, the visual branch is basically the audiovisual learning network. It takes image and corresponding sound as inputs, and extracts the visual representation of specific sound-maker in the scene. To automate this process, we use the audiovisual scenes with single sound-source. This is because we can directly localize the sound-maker by comparing different visual representations with the unique sound and without manual distinguishment. Then, we can use the corresponding clustering center as the representation of the localized sound-maker, which is a $512D$ vector.

The sound separation branch is a variant of U-Net[25], which consists of an encoder and a decoder. The encoder contains 6 convolution layers with $16-32-64-128-256-512$ channels, while the decoder contains 6 up-convolution layers with $512-256-128-64-32-16$ channels. The convolution layers in the encoder use $4\times 4$ filters, and followed by a BatchNormalization and a LeakyReLu (with a slope of $0.2$ ) layer. The up-convolution layers in the decoder also use $4\times 4$ filters, but followed by a BatchNormalization and a ReLu layer. The last up-convolution layer is followed by a sigmoid function to match value of the spectrogram mask. Similar to the original U-net, we apply skip-connection between symmetric encoder and decoder layers.

To integrate the above two branches, the $512D$ visual representation is first passed to a fully-connected layer of $512-512$ , then a BatchNormalization and a LeakyReLU (with a slope of $0.2$ ) layer. The resulted $512D$ visual vector is replicated to match the size of encoded audio feature maps. Then, the audio and visual feature maps are concatenated together, and fed to the decoder of the sound separation branch.

6.5 More results

6.5.1 Sound Separation

AudioSet-Instrument Following [11], all the clips in AudioSet[13] are filtered to construct a subset of 15 musical instruments. The filtered clips from the Unbalanced-Train set constitute the training dataset, the ones from Balanced-Train set are splitted for validation and testing. As some video clips in AudioSet have been removed by the uploader, the whole instrument dataset is smaller than the ones in [12], which is about 99,882/456/456 train/val/test clips. As 93,679 video clips in the training dataset have single sound-source, we directly use them for training the audiovisual learning model. Then, we use the well-trained model for extracting the visual representation of localized sound-source, with which we train the separation model, as shown in Fig. 7.

Table. 6 shows the sound separation results. We can find that our model shows superior performance over most compared methods, even some of them adopt additional visual knowledge. For example, AV-MIML [11] and Sound-of-Pixels [31] use ImageNet pre-trained model as the visual extractor. Besides, we also show the results of CAVL- $\mathcal{C}_{1}$ . Such results are obtained by directly viewing the audio assignment in the audiovisual learning model as the spectrogram mask. However, as the audio assignment is achieved over the embedded feature maps, it is too coarse to perform fine-grained spectrogram prediction. Hence, such method does not work for sound separation.

Methods	SDR	SIR	SAR
Sound-of-Pixels[31]	1.66	3.58	11.5
AV-MIML[11]	1.83	-	-
NMF-MFCC[28]	0.25	4.19	5.78
Co-Separation[12]	3.65	6.13	13.2
CAVL- $\mathcal{C}_{1}$	-4.54	0.23	2.28
Our-Model-Center	2.40	3.33	18.24
Our-Model-Mask	2.64	3.45	17.17

Table 6: Sound separation results on AudioSet-Instrument test set.