This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sound of Story: Toward Multimodal Storytelling with Audio

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
\AndSecond Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Abstract

Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called Sound of Story (SoS), which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https://github.com/Sosdatasets/SoS_Dataset.

1 Introduction

111* These authors contributed equally to this work.

Story understanding and telling is one of the important but challenging problems in natural language processing and machine learning, and there have been extensive studies on this problem for decades harrison2017toward; kovcisky2018narrativeqa. Many prior researches have focused on building models that understand stories such as utilizing pre-trained model chen2022pali; wang2023image; he2023vlab, event-centric understanding chen2021event and using multiway-sampler xu2023multi. Also, lots of approaches are proposed that aim to understand the stories within the provided text (mostafazadeh2016corpus; fan2018hierarchical; akoury2020storium), image huang2016visual; bensaid2021fairytailor; krojer2022image and video rohrbach2015dataset; xu2016msr; tapaswi2016movieqa; li2019video; bain2020condensed; huang2020movienet.

Refer to caption
Figure 1: Overall data generating process flow for our Sound of Story (SoS) dataset. All the modalities are extracted from the same movie clips. We exclude the speech from audio to make it contain only pure audio data, which does not contain linguistic information.

Despite there being many benchmarks for story understanding and telling, most of them less focus on audio even if it is one of the key elements for humans to understand stories lang1998emotion; lake2017building. Therefore, we propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information. For this purpose, we introduce a novel dataset called the Sound of Story (SoS), which aims to leverage audio information for story understanding, distinguishing it from other existing story or audio datasets.

To create the SoS dataset, all the modalities are extracted from Condensed Movie Dataset (CMD) (bain2020condensed) and Large Scale Movie Description Challenge datasets (LSMDC) (rohrbach2015dataset). Image and text data extracted from the same movie clip have their contexts and both of them have the possibility to infer the story from the same movie clip. And in the case of audio such as background sound and music, it has the ability to enrich storytelling by representing and recognizing the context of the situations eronen2005audio; stowell2015detection. Due to these characteristics, audio, image, and text extracted from the same movie clip become correlated and complementary. So, our SoS dataset allows three modalities generated from the same movie clip to be used for story understanding as a pair. In addition, we propose new story understanding benchmarks and their baseline models by introducing audio retrieval tasks and audio generation tasks using the SoS dataset.

These background sound tasks can be utilized in various applications. For example, for movies and cartoons, background sound is often produced according to a designated scene using various props or objects for a sense of reality. In the case of the retrieval task, among the sound sources made, it is possible to recommend the sound source which fits the best to the scene. Furthermore, background sound generation can produce more realistic audio while reducing human resource costs by directly generating the right audio for a given scene. The overall data generating process can be shown in Figure 1.

In summary, our contributions are as follows:

  • We propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information, which has not been well explored. For this purpose, we introduce a new dataset named Sound of Story (SoS) consisting of text and image sequences and audio for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound.

  • As benchmark tasks to show the significance and relevance of sound in a story, we introduce retrieval tasks between audio and other modalities and background sound generation tasks, and baseline models for each of them.

  • We will release the dataset and code to generate them so that the community can generate more data for a variety of tasks to utilize sound in storytelling and story understanding research.

2 Related Work

Refer to caption
Figure 2: Example of SoS dataset. CMD and LSMDC have different lengths of descriptions.
Story understanding and telling

Storytelling tasks pose significant challenges in the domain of natural language processing (NLP), and researchers have proposed various datasets and benchmarks to tackle these problems. Wikiplots, WritingPrompts  fan2018hierarchical, and ROCStories mostafazadeh2016corpus introduce datasets that revolve around text-based story plots. Wikiplots comprises a collection of approximately 113k story plots extracted from English Wikipedia. WritingPrompts fan2018hierarchical assembled their dataset by gathering pairs of human-written stories and corresponding prompts from online sources. ROCStories mostafazadeh2016corpus employed the use of Amazon Mechanical Turk (AMT) to generate short stories that incorporate common-sense knowledge, with a focus on everyday topics. These datasets are usually used for story understanding or story generation tasks.

In addition to text-based story plots, there exist numerous story datasets that incorporate images or videos. Visual Storytelling huang2016visual introduces the first sequential vision-to-language dataset and explores its applicability in various tasks. It goes beyond simple object understanding or captioning for concrete scenes by considering input that encompasses temporal events. Similarly, ImageCoDe dataset krojer2022image also contains stories but is designed to enable models to learn detailed situations from finer visual differences within similar sets of images.

Movie data inherently encompasses storytelling. Various datasets such as MovieQA tapaswi2016movieqa and MovieNet huang2020movienet have been collected by using movie data and have been used in various multimodal domains. Additionally, datasets like CMD bain2020condensed and LSMDC liu2019use not only provide movie videos but also include descriptions for the corresponding clips. This offers researchers an opportunity to explore models from a multimodal perspective. However, there are few datasets for story with sound. Existing benchmarks for audio mostly consist of music bertin2011million; sturm2012analysis; ferraro2021melon or animal/environmental sounds salamon2014dataset; piczak2015esc; gemmeke2017audio; kim2019audiocaps, and are designed to evaluate simple matching performance between objects (or sentiment) and audio input.

Multi-modal retrieval tasks

Multi-modal retrieval models are trained to find similar features across different modalities by fusing. These models are tasked with understanding both the context of a given input and the interplay between features provided by other modalities. As a result, retrieval becomes a highly challenging and crucial task in order to comprehend the distinctive characteristics of multiple modalities.

The multi-modal retrieval task has been studied using various modalities such as image-text retrieval wang2020cross; zhang2020context; cheng2022cross; luo2022clip4clip; xuan2023dissecting, video-text ma2022x; gorti2022x; zhu2023complementarity, audio-image xu2020cross; yang2022multimodal; nakatsuka2023content, video-audio suris2018cross; gu2023dual; cheng2023ssvmr and audio-text kim2022improving; xin2023improving. Particularly, CLIP4CLIP luo2022clip4clip, which performs well in the video-text retrieval task by calculating the similarities between the features of each modality obtained from the encoder, and X-CLIP ma2022x expands CLIP4CLIP and proposes a multi-grained regulation function to improve performance. Furthermore, Wav2CLIP wu2022wav2clip conducts retrieval for audio-image modalities based on CLIP architecture radford2021learning, while CLAP elizalde2023clap focuses on audio-text retrieval. In this paper, we propose a story-based audio retrieval task for retrieving proper modal features like image sequences or descriptions.

Dataset Audio Type # Story # Images / Story # Audio Hours Avg. text length Text Type
ImageCode - 9,402 17 - 23.3 Captions
VIST - 50,136 4.2 - 10.2 Description
Video Storytelling open 105 - 22 162.6 Captions
MovieNet open 1,100 - 633 2004 Description
MSR-VTT open 10,000 - 41.2 9.6 Description
Ours
speech-decoupled
audio
27,354 19.6 984 88 Description
Table 1: Statistics compared to different storytelling datasets. In the audio types, "open" represents raw audio from the video, and "speech-decoupled audio" represents the speech-excluded audio. In the text types, "Caption" represents the text that explains only the image, and "Description" represents the text that includes the story.
Multi-modal audio generation

There have been prior studies on generating audio by receiving certain modality information as input vasquez2019melnet; yu2022museformer; borsos2022audiolm; neves2022generating; schneider2023mo; yang2023diffsound; agostinelli2023musiclm. Among several audio generation tasks, recently text-audio generation has been actively studied. AUDIOGEN kreuk2022audiogen uses an augmentation technique that mixes different audio samples when generating audio samples from captions. Riffusion Forsgren_Martiros_2022 fine-tunes the model to generate Spectrogram images using the image-to-text based on Stable Diffusion rombach2022high. MUSICGEN copet2023simple uses a single-stage language model with efficient token interleaving patterns and an unsupervised melody conditioning method. In this paper, we are interested in generating relevant audio given a story.

3 Dataset

We propose a novel dataset named the SoS dataset with 27,354 audio-image-text pairs. 22,330 pairs are created from CMD (bain2020condensed) and 5,024 from LSMDC dataset (rohrbach2015dataset). We extract the image and audio from the same video and pair them with the description accordingly so that the images and text share the same storyline. In the case of audio, it is related to the other modalities due to being extracted from the same clip so can be used in complementary. The examples of the SoS dataset are shown in Figure 2.

3.1 Movie Clips

CMD (bain2020condensed) provides the key clips and their high-level semantic descriptions from 3,605 movies. Each key clip has a duration of around 150 seconds and can be automatically obtained from YouTube, making it easily accessible. LSMDC (rohrbach2015dataset) consists of 118,081 short clips extracted from 202 movies. Each clip in LSMDC has a duration of around 3 to 12 seconds and is accompanied by captions generated from Description Video Services (DVS). In our work, we merge these two movie clip datasets to create the audio-image-text paired dataset. To bridge the gap between CMD and LSMDC, we adapt the LSMDC dataset to match the format of CMD. Since the short movie clips in LSMDC are contiguous, we concatenate them in sequential order until their combined length is similar to that of CMD. Specifically, we concatenate around 20 short movie clips together to create the raw movie clip data.

3.2 Audio processing

To facilitate audio processing, we first utilized ffmpeg tomar2006converting to extract audio from the video, ensuring that the length of the extracted audio matched that of the raw movie clip video. Subsequently, we employed the bytesep framework proposed by kong2021decoupling to decouple speech from mixed audio containing various sounds. The results of decoupled sounds may be considered as audio data that share the storyline with the raw movie clip video without any linguistic information. Speech-decoupling is proceeded because the speech itself includes the storyline and the model can learn the storyline focused on the speech, not the audio which is not our expectation. Duration distribution of the extracted audios can be seen in Figure 3.

Refer to caption
Figure 3: Distributions of each modality. (a) the number of extracted images from a movie video; (b) the duration of decoupled and processed audio for each story (c) the number of words in each story from CMD (d) the number of words in each story from LSMDC. Note that due to video clip concatenation to match video lengths, descriptions have larger lengths for stories from LSMDC. This text variation leads to the model requiring high-level text understanding.

3.3 Image processing

Since videos include many duplicated and similar images, only the key salient images that may represent storyline were extracted through image clustering. We conduct image clustering 222https://github.com/elcorto/imagecluster using image fingerprints, which are representations for uniquely identifying an image generated based on the visual features. From all image fingerprints extracted from frames of the movie clips, we gradually removed images with a fingerprint similarity of greater than 0.5 which was set empirically found. Through this process, fewer frames for static and more frames for dynamic scenes could be retained. The extracted image sequences can represent the entire video by capturing salient scenes with significant variations, even though it does not include all of the frames. By utilizing this approach, we can create essential image sequences of the video by focusing on scenes with meaningful changes. These processed image sequences provide a novel dataset that highlights the storyline with fewer frames and differentiates it from traditional video datasets. Figure 3 shows the distribution of extracted frames for each movie clip.

3.4 Text processing

There exist slight differences in clip lengths between CMD and LSMDC even though both of them provide short descriptions for each movie clip. Because we concatenate LSMDC movie clips to match the length of CMD movie clips, descriptions also must be formatted by concatenating the same number of descriptions with the video. As a result, the concatenated descriptions exhibit variations in sentence lengths as in Figure 3.

Furthermore, there are also inherent differences between the two datasets in how the descriptions were generated. CMD utilizes descriptions generated by users from YouTube, while LSMDC employs descriptive video services to generate descriptions. Because of this text variation difference between the two datasets, it makes our task more challenging by requiring high-level text understanding. We can check that the descriptions are well-aligned with other modalities in Figure  2.

Refer to caption
Figure 4: Proposed retrieval model architecture. Encoded features are obtained from the encoder corresponding to each modality. Features are divided into context-level features and sequence-level features. Then we calculate the score for each task and use the score to calculate the final similarity Sim{Sim}.

3.5 Data Statistics and Analysis

Our SoS dataset is derived from 3,807 movies covered in the CMD and LSMDC datasets, resulting in a total of 27,354 clips. After processing, each clip has an average of 20 images, and the total number of images in our dataset is 535,707. The accumulated length of the extracted audio is around 984 hours, and each audio consists of around 150 seconds on average. Each clip’s text description consists of 88 words on average. If divided into two different datasets, CMD and LSMDC consist of an average of 83.2 words and 1,378.7 words, respectively, with medians of 80 and 130.8, respectively. Quantitative comparison with other story-based datasets is shown in Table 1, and it can be seen that our proposed SoS dataset is the largest well-curated storytelling dataset with extracted audios.

We generate paired decoupled audio data that can be used complementarily for story understanding and telling with the image and text in the SoS dataset. The extracted decoupled audio, which is in-the-wild audio, contains abundant information. Specifically, we decoupled speech information to leave only pure audio consisting of background sound on the audio dataset. These remaining audios contain a mixture of various daily-life sounds and background music. This distinctive characteristic sets it apart from other audio datasets, making it an adequate audio dataset for story understanding.

4 Experiment

4.1 Retrieval Tasks

In this section, we introduce four retrieval tasks: audio-to-video, audio-to-text, text-to-audio, and video-to-audio retrieval using our proposed dataset. The retrieval tasks focus on finding proper audio corresponding to image sequences or text information containing a story, or matching image sequence or text given audio, which may be used as recommendation for background sound suitable for image sequence or storyline.

A2V V2A A2T T2A
Eval Ours Wav2CLIP Ours Wav2CLIP Ours CLAP Ours CLAP Random
R@1(↑) 7.615 4.231 6.462 5.154 7.462 6.538 7.077 5.769 0.076
R@5(↑) 18.682 12.615 17.923 16.077 17.462 17.231 19.385 16.077 0.378
R@10(↑) 28.000 19.538 27.000 23.462 24.692 24.615 25.692 23.538 0.750
Median R(↓) 40.615 31.231 38.923 33.692 33.000 34.615 32.769 34.385 631.800
Mean R(↓) 32.000 52.000 33.000 47.000 60.000 52.000 63.500 53.500 639.132
Table 2: Retrieval performances with the proposed SoS dataset, which are measured on 1300 test samples. Random means the average of the results of five random retrievals.

4.1.1 Retrieval Model

In this section, we introduce the model used in our proposed retrieval task. We are inspired when we create a baseline model in that the Attention Over Simplicity Matrix method proposed by X-CLIP ma2022x can be flexibly applied to Audio as well as Video-Text. Our overall architecture is illustrated in Figure 4.

For descriptions, we use the CLIP text encoder ET{E_{T}} radford2021learning. The information obtained from the Et{E_{t}} is defined as sentence feature fs{f_{s}} and word feature fw{f_{w}}. In the case of video, the CLIP image encoder EI{E_{I}} is used and the temporal encoder Et{E_{t}} is added to obtain the temporal sequence feature. The obtained feature is defined as frame feature ff{f_{f}}, and the mean pooling of the frame feature is defined as video feature fv{f_{v}}. And in the case of the audio, audio information is extracted by performing audio embedding on Log Mel-filter bank coefficients and fed into the EA{E_{A}} using pretrained Audio Spectrogram Transformer gong21b_interspeech. In this case, the first token among the acquired feature is defined as audio fa{f_{a}}, and the remaining tokens are defined as audio sequence feature fas{f_{as}}.

In the case of audio-video retrieval, we first calculate the similarity between audio-video(Sa,vS_{a,v}), audio-frame(Sa,fS_{a,f}), audio sequence-video(Sas,vS_{as,v}), and audio sequence-frame(Sas,fS_{as,f}). These similarities are calculated following Equation 4.1.1 to 4.1.1:

{align}

S_a,v&=(f_a)^T(f_v) ∈\mathbbR^1
S_a,f=(f_ff_a)^T ∈\mathbbR^1 ×m
S_as,v=f_asf_v ∈\mathbbR^n ×1
S_as,f=(f_as)(f_f)^T ∈\mathbbR^n ×m where mm is the number of video frames and nn is the number of audio sequences. Except for audio-video similarity, an aggregation process is required to make it an instance-level similarity. Therefore, the audio-frame and audio sequence-video similarity is calculated through the following equations:

{align}

S’_a,f&= ∑_i=1^m exp(Sa,f(1,i))∑j=1mexp(Sa,f(1,j)) S_a,f
S’_as,v= ∑_i=1^n exp(Sas,v(i,1))∑j=1nexp(Sas,v(j,1)) S_as,v

The audio-level similarity(SaudS^{\prime}_{aud}) and video-level similarity(SvidS^{\prime}_{vid}) are calculated to obtain the audio sequence-frame instance-level similarity:

{align}

S_aud&=∑_i=1^n exp(Sas,f(i,*))∑j=1nexp(Sas,f(j, *)) S_as,f(i,⁢) ∈\mathbbR^1 ×m
S’_aud=∑_i=1^mexp(Saud(1,i))∑j=1mexp(Saud(1, j))S_aud(1,i)
S_vid=∑_i=1^m exp(Sas,f(⁢,i))∑j=1mexp(Sas,f(⁢, j))S_as,f(⁢,i) ∈\mathbbR^n ×1
S’_vid=∑_i=1^nexp(Svid(i,1))∑j=1nexp(Svid(j, 1))S_vid(1,1) where * denotes all elements in that dimension.

In addition, Sas,fS^{\prime}_{as,f} are obtained from the means of SaudS^{\prime}_{aud} and SvidS^{\prime}_{vid} obtained by Equation 4.1.1 and 4.1.1. Therefore, the similarity between video and audio is obtained as the average of the obtained similarities, as shown in Equation 1.

SimAV=(Sa,f+Sa,f+Sas,v+Sas,f)/4Sim_{AV}=(S_{a,f}+S^{\prime}_{a,f}+S^{\prime}_{as,v}+S^{\prime}_{as,f})/4 (1)

For the audio-text retrieval, the similarity SimATSim_{AT} can be calculated by replacing video-related features with text-related features.

4.1.2 Implementation Details

Of the 27,354 stories in our SoS dataset, 26,054 are split into train and 1,300 into test sets. In the case of video, a total of 40 images are used as input. At this time, if there are fewer than 40 images, 40 images are filled with duplicate sampling in accordance with the time order, and if there are more than 40, 40 images are obtained by uniform sampling. In the case of audio, the entire audio is used for learning, and random cropping is performed at this time. In the test, the same audio can be used in all tests by cropping and using the information-rich area of the audio. In the case of text, after tokenizing the description with a CLIP tokenizer, we crop 77 tokens from the front and use them similar to other CLIP-based models luo2022clip4clip; ma2022x; li2023decap. The batch size is 128 in the audio-video and 256 in the audio-text retrieval task, respectively. We conducted the experiment using 2 NVIDIA A100 80GB GPUs using the PyTorch library for 100 epochs, which took about 20 hours.

4.1.3 Results and Analysis

To evaluate the performance of the proposed baseline model with our SoS dataset, we compare it with Wav2CLIP wu2022wav2clip for A2V and V2A, and CLAP elizalde2023clap for A2T and T2A. Both Wav2CLIP and CLAP are fine-tuned to our SoS dataset from their pretrained weights. For the evaluation metrics, we utilize Recall@1(R@1), Recall@5(R@5), Recall@10(R@10), Mean Rank(Mean R), and Median Rank(Median R). Table 2 shows the performances for audio-to-video (A2V), video-to-audio (V2A), audio-to-text (A2T), and text-to-audio (T2A).

For A2V and V2A, each story of the SoS dataset consists of fairly long image sequences by using clustering from the video which has an average length of 215 seconds, unlike other datasets. In addition, our audio dataset includes various sounds to express a story and thus can better represent storylines compared to audio composed of simple sounds.

However, it also presents greater challenges. Therefore, compared to retrieving tasks using short videos and simple audio, it can be considered a challenging task. Nevertheless, with R@1 of 7.615 and 6.462 for both A2V and V2A, respectively, our baseline model performed better than its comparative model, Wav2CLIP, and showed reasonable evaluation results.

In addition, our text data consists of a storyline that implicitly represents the whole sequence, not a description of the scene, so it is challenging because the model has to understand the context of the scenes rather than a simpler understanding of the scene. Therefore, looking at the R@1 performance 7.462 and 7.077 in A2T and T2A, respectively, the performance of our proposed baseline model is better than the comparison model CLAP. Also, the results of retrievals can be found in our supplementary or GitHub link 333https://github.com/Sosdatasets/SoS_Dataset.

4.2 Audio Generation Task

Composing and arranging background music and sound effects that exquisitely match the current scene’s story is a highly challenging task that even human creators struggle thom1999designing. We propose an audio generation model called SoSGen, which is designed to generate appropriate audio for the movie scene, given the scene’s descriptions, frames, or both as the condition.

Refer to caption
Figure 5: SoSgen architecture for training. The components and arrows drawn with dash lines are only used in the ControlNet-involved training, i.e. image-to-audio and image+text-to-audio. To train a text-to-audio model, we condition the denoising model with CLIP text embeddings of SoS descriptions. Once the text-to-audio model is sufficiently trained, we freeze the model and add image embeddings as a condition to train the text+image-to-audio model using ControlNet. Alternatively, an image-to-audio model can be trained from the Riffusion checkpoint, using a redundant text condition "Audio" for each training data. \mathcal{E} denotes the encoder, 𝒟\mathcal{D} denotes the decoder, 𝒵𝒯\mathcal{Z_{T}} denotes latent representation at timestep 𝒯\mathcal{T}.

4.2.1 Audio Generation Model

SoSgen takes advantage of the architecture of diffusion models sohl2015deep, which learn a data distribution through gradual denoising. Stable diffusion models rombach2022high, a variant of diffusion models, enable effective model conditioning by training the model to learn latent features. Following the same approach, we train SoSgen as a text-to-audio spectrogram diffusion model, using the following objective function:

=\mathbbEx,ϵ𝒩(0,1),t,c[ϵϵθ(zt~,t,c(d))22]\mathcal{L}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t,c}\left[\|\epsilon-\epsilon_{\theta}(\tilde{z_{t}},t,c(d))\|^{2}_{2}\right] (2)

where cc is a trainable CLIP text encoder and dd is the SoS description.

For image+text-to-audio training, we follow ControlNet zhang2023adding architecture, a lightweight and powerful method that enables additional conditioning of a diffusion model by freezing the original model’s parameters. We freeze the text-to-audio model’s parameters and resume training using a similar objective function:

{aligned}=&\mathbbEx,ϵ𝒩(0,1),t,c,cimg[ϵϵθ(zt~,t,c(d),cimg(f))22]\aligned\mathcal{L}=&\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t,c,c_{img}}\\ \left[\|\epsilon-\epsilon_{\theta}(\tilde{z_{t}},t,c(d),c_{img}(f))\|^{2}_{2}\right] (3)

where cimgc_{img} is an additional CLIP image encoder that embeds the SoS frame sequence ff.

4.2.2 Implementation Details

Audio in movies rarely remains consistent until the end of a single scene. Also, using long audio as an input or output for a deep learning model is burdening in terms of hardware capacity. Thus, we slice the audio data into a length of approximately 15 seconds. Since extracted frames partially reflect semantic changes within the scene, audio slicing was performed treating audio data between two extracted frames as a base unit. The unit audios are accumulated and concatenated until their cumulative duration exceeds 15 seconds, ensuring that each sliced audio is never longer than 15 seconds. Audios that are originally longer than 15 seconds are not sliced because it implies the absence of significant visual changes for the duration, suggesting a high probability of long, consistent audio. Sliced audio files containing only a little sound were eliminated using the PyDub444https://github.com/jiaaro/pydub library. To this end, we assessed the mean frequency of each segment and excluded any with a frequency below \SI78\hertz, a threshold determined through manual inspection.

This process conducted 185,341 audio data of roughly 15 seconds each, which were then converted into spectrograms. From this, we randomly split the dataset into 184,462 and 879 as training and testing, respectively.

For our base model, we leveraged Riffusion Forsgren_Martiros_2022, a checkpoint of Stable Diffusion v1-5 rombach2022high, fine-tuned on pairs of music descriptions and spectrograms. Following the same approach, we fine-tuned the diffusion model from the Riffusion checkpoint with pairs of SoS descriptions and corresponding sliced audio spectrograms. In addition, we further trained the model in the setting of image and text conditioning, utilizing the ControlNet zhang2023adding architecture. We use a sequence of frames extracted through clustering that correspond to the sliced audio as the image condition. SoStext+image was additionally trained from SoStext checkpoint, while SoSimage resumed from the Riffusion checkpoint and used a redundant text condition in both training and testing. The specific model architecture can be found in Figure 5.

For training our models, we used a batch size of 1, 16 gradient accumulation steps, an initial learning rate of 1e-6, and a cosine learning rate scheduler. We fine-tuned the text-to-audio model for 406,000 steps and additional 525,000 steps to further train the text+image-to-audio model. Separately, the image-to-audio model was trained for 525,000 steps. We ran fine-tuning for 2 days and ControlNet training for another 2 days on a single NVIDIA GeForce RTX 3090 GPU.

4.2.3 Results and Analysis

Baselines

We compare SoSgen to two baselines: Riffusion Forsgren_Martiros_2022 and MUSICGEN copet2023simple, the state-of-the-art open source text-to-music generation model. The performance of the models is evaluated based on the Fréchet Audio Distance (FAD) kilgour2019frechet. FAD is a reference-free evaluation metric that is computed using means and covariances of VGGish hershey2017cnn embeddings. We used a lightweight pytorch implementation555https://github.com/Sosdatasets/SoS_Dataset of FAD for more efficient evaluation.

Table 3: FAD (lower is better) of SoSgen and the baselines. SoSgentext+image is the only model given with the image and text condition. The rest are only given with SoS descriptions as the condition.
Model FADVGGishFAD_{VGGish}\downarrow
\hhline=== Riffusion 20.114
MUSICGEN 11.680
SoSgentext 10.935
SoSgenimage 18.324
SoSgentext+image 9.099

Table 3 shows the comparison of SoSgen against Riffusion and MUSICGEN. SoSgentext, Riffusion, and MUSICGEN generate an audio spectrogram given descriptions. For SoSgenimage and SoSgentext+image, an image sequence is also given as an additional condition.

The results indicate that SoSgentext and SoSgentext+image outperform Riffusion and MUSICGEN in terms of FAD. Although SoSgenimage only shows a slight improvement from Riffusion, combining two modalities exceptionally enhances audio generation capability. The generated audio samples can be listened to in our GitHub link 666https://github.com/Sosdatasets/SoS_Dataset and are also included in the supplementary material. The samples show SoSgen’s superior performance on story audio generation tasks.

Limitations

To use audio data in story understanding, there are some limitations to be solved. First, we create audio data by decoupling the speech to exclude linguistic information and preserving only pure audio. However, the generated audio may be yet distant from what people encounter in daily life because there is a possibility that various noises caused by the generation from the movie clip remain. Second, the audio-image-text pair is strongly connected, but it is not perfect due to the nature of the audio. Since audio includes a lot of subjective elements, various correct answers may exist for one story. To deal with these limitations, more strict and cautious policies to process not only audio but also image and text for story understanding should continue to be studied as future work.

5 Introduction

These instructions are for authors submitting papers to EMNLP 2023 using . They are not self-contained. All authors must follow the general instructions for *ACL proceedings,777http://acl-org.github.io/ACLPUB/formatting.html as well as guidelines set forth in the EMNLP 2023 call for papers. This document contains additional instructions for the style files. The templates include the source of this document (EMNLP2023.tex), the style file used to format it (EMNLP2023.sty), an ACL bibliography style (acl_natbib.bst), an example bibliography (custom.bib), and the bibliography for the ACL Anthology (anthology.bib).

6 Engines

To produce a PDF file, pdf is strongly recommended (over original plus dvips+ps2pdf or dvipdf). Xe also produces PDF files, and is especially suitable for text in non-Latin scripts.

Command Output
{\"a} ä
{\^e} ê
{\‘i} ì
{\.I} İ
{\o} ø
{\’u} ú
{\aa} å
Command Output
{\c c} ç
{\u g} ğ
{\l} ł
{\~n} ñ
{\H o} ő
{\v r} ř
{\ss} ß
Table 4: Example commands for accented characters, to be used in, e.g., Bib entries.

7 Preamble

Output natbib command Old ACL-style command
(ct1965) \citep \cite
ct1965 \citealp no equivalent
ct1965 \citet \newcite
(ct1965) \citeyearpar \shortcite
ct1965’s (ct1965) \citeposs no equivalent
(FFT; ct1965) \citep[FFT;][] no equivalent
Table 5: Citation commands supported by the style file. The style is based on the natbib package and supports all natbib citation commands. It also supports commands defined in previous ACL-style files for compatibility.

The first line of the file must be

\documentclass[11pt]{article}

To load the style file in the review version:

\usepackage[review]{EMNLP2023}

For the final version, omit the review option:

\usepackage{EMNLP2023}

To use Times Roman, put the following in the preamble:

\usepackage{times}

(Alternatives like txfonts or newtx are also acceptable.) Please see the source of this document for comments on other packages that may be useful. Set the title and author using \title and \author. Within the author list, format multiple authors using \and and \And and \AND; please see the source for examples. By default, the box containing the title and author names is set to the minimum of 5 cm. If you need more space, include the following in the preamble:

\setlength\titlebox{<dim>}

where <dim> is replaced with a length. Do not set this length smaller than 5 cm.

8 Document Body

8.1 Footnotes

Footnotes are inserted with the \footnote command.888This is a footnote.

8.2 Tables and figures

See Table 4 for an example of a table and its caption. Do not override the default caption sizes.

8.3 Hyperlinks

Users of older versions of may encounter the following error during compilation:

\pdfendlink ended up in different nesting level than \pdfstartlink.

This happens when pdf is used and a citation splits across a page boundary. The best way to fix this is to upgrade to 2018-12-01 or later.

8.4 Citations

Table 5 shows the syntax supported by the style files. We encourage you to use the natbib styles. You can use the command \citet (cite in text) to get “author (year)” citations, like this citation to a paper by Gusfield:97. You can use the command \citep (cite in parentheses) to get “(author, year)” citations (Gusfield:97). You can use the command \citealp (alternative cite without parentheses) to get “author, year” citations, which is useful for using citations within parentheses (e.g. Gusfield:97).

8.5 References

The and Bib style files provided roughly follow the American Psychological Association format. If your own bib file is named custom.bib, then placing the following before any appendices in your file will generate the references section for you:

\bibliographystyle{acl_natbib}
\bibliography{custom}

You can obtain the complete ACL Anthology as a Bib file from https://aclweb.org/anthology/anthology.bib.gz. To include both the Anthology and your own .bib file, use the following instead of the above.

\bibliographystyle{acl_natbib}
\bibliography{anthology,custom}

Please see Section 9 for information on preparing Bib files.

8.6 Appendices

Use \appendix before any appendix section to switch the section numbering over to letters. See Appendix A for an example.

9 Bib Files

Unicode cannot be used in Bib entries, and some ways of typing special characters can disrupt Bib’s alphabetization. The recommended way of typing special characters is shown in Table 4.

Please ensure that Bib records contain DOIs or URLs when possible, and for all the ACL materials that you reference. Use the doi field for DOIs and the url field for URLs. If a Bib entry has a URL or DOI field, the paper title in the references section will appear as a hyperlink to the paper, using the hyperref package.

Limitations

EMNLP 2023 requires all submissions to have a section titled “Limitations”, for discussing the limitations of the paper as a complement to the discussion of strengths in the main text. This section should occur after the conclusion, but before the references. It will not count towards the page limit.

The discussion of limitations is mandatory. Papers without a limitation section will be desk-rejected without review. ARR-reviewed papers that did not include “Limitations” section in their prior submission, should submit a PDF with such a section together with their EMNLP 2023 submission.

While we are open to different types of limitations, just mentioning that a set of results have been shown for English only probably does not reflect what we expect. Mentioning that the method works mostly for languages with limited morphology, like English, is a much better alternative. In addition, limitations such as low scalability to long text, the requirement of large GPU resources, or other things that inspire crucial further investigation are welcome.

Ethics Statement

Scientific work published at EMNLP 2023 must comply with the ACL Ethics Policy. We encourage all authors to include an explicit ethics statement on the broader impact of the work, or other ethical considerations after the conclusion but before the references. The ethics statement will not count toward the page limit (8 pages for long, 4 pages for short papers).

Acknowledgements

This document has been adapted by Yue Zhang, Ryan Cotterell and Lea Frermann from the style files used for earlier ACL and NAACL proceedings, including those for ACL 2020 by Steven Bethard, Ryan Cotterell and Rui Yan, ACL 2019 by Douwe Kiela and Ivan Vulić, NAACL 2019 by Stephanie Lukin and Alla Roskovskaya, ACL 2018 by Shay Cohen, Kevin Gimpel, and Wei Lu, NAACL 2018 by Margaret Mitchell and Stephanie Lukin, Bib suggestions for (NA)ACL 2017/2018 from Jason Eisner, ACL 2017 by Dan Gildea and Min-Yen Kan, NAACL 2017 by Margaret Mitchell, ACL 2012 by Maggie Li and Michael White, ACL 2010 by Jing-Shin Chang and Philipp Koehn, ACL 2008 by Johanna D. Moore, Simone Teufel, James Allan, and Sadaoki Furui, ACL 2005 by Hwee Tou Ng and Kemal Oflazer, ACL 2002 by Eugene Charniak and Dekang Lin, and earlier ACL and EACL formats written by several people, including John Chen, Henry S. Thompson and Donald Walker. Additional elements were taken from the formatting instructions of the International Joint Conference on Artificial Intelligence and the Conference on Computer Vision and Pattern Recognition.

Appendix A Example Appendix

This is a section in the appendix.