Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training
Abstract.
Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features. Based on it, a locality-aware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment. Experiments on eleven zero-shot coarse- and fine-grained tasks suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works.
1. Introduction
With the advance of learning theories and data collections (Gemmeke et al., 2017), large-scale pre-trained models, such as PANNs (Kong et al., 2020) and AST (Gong et al., 2022), have witnessed extraordinary achievements on sound-related challenges, such as sound classification (Salamon and Bello, 2017) and sound event detection (Li et al., 2024c). Despite such success, these methods require downstream tuning to adapt to novel scenarios and cannot facilitate tasks related to natural language, e.g., retrieve or generate audio clips (Liu et al., 2023; Xin et al., 2023; Xie et al., 2023) according to human instructions. Alternatively, Contrastive Language-Audio Pre-training (CLAP) (Elizalde et al., 2023) is introduced to learn general and transferable representations by associating audios with corresponding captions. Consequently, an aligned feature space is built, making it versatile for several tasks, such as zero-shot audio tagging and retrieval, by simply computing the cosine similarity between encoded audio and textual features of sound classes (Li et al., 2024b).
However, during empirical practices, we notice that current CLAP models lack the capability of capturing the fine-grained alignment like the relationship between acoustic events and textual meanings. An example of this phenomenon is depicted in Figure 1 (a). As seen, although the two kinds of events, namely alarm and speech, are successfully recognized by the original CLAP, the similarity between frame representations and textual sound representations shows much inconsistency with the real temporal locations of sound events. For instance, the sound ”alarm” is recorded at 2.8s-4.5s and 5.4s-7.0s, but the corresponding frame-level similarity is high over the whole clip. This may undermine the model’s explainability and lead to undesirable results on fine-grained cross-modal understanding tasks, including zero-shot sound event detection and text-to-audio grounding (Xu et al., 2021). Moreover, poor performance can also be observed in certain cases when conducting coarse-grained tasks like zero-shot audio tagging and retrieval, since local patterns and temporal information are potentially ignored by the vanilla CLAP paradigm. We attribute the above problem to the lack of interaction between frame and word features during CLAP training, as current CLAP methods reach the cross-modal alignment via solely the similarity of the global features of each modality.

To mitigate the research gap, we propose to adopt a modality-shared codebook to encourage the multi-modal features to interact on a finer granularity. The codebook consists of several learnable codewords, and the weighted summation of them will be utilized to represent the global features of each modality, so that they are naturally restricted in the same feature space, making it easier to learn the alignment. To encode cross-modal shared semantic concepts (e.g., sound events) into each codeword, we further revise the traditional working scheme of the codebook to compute the aggregation weights. Practically, we define the affinity scores between a clip (or a caption) and each codeword as the maximum cosine similarity between its frame features (or word features) and the specific codeword. Then, the global feature can be represented with a small number of codewords by applying sparse constraints on the affinity scores to avoid noisy activation before using them as aggregation weights. Through optimizing the contrastive loss, not only the paired global features can be well-aligned, but also the frame features of an acoustic event (e.g., ”alarm”) and the word features of the corresponding caption can activate the same, small set of codewords, thereby implicitly building a connection between fine level multi-modal features. Moreover, we also notice that local acoustic patterns may be destroyed by the vanilla transformer block and devise a novel locality-aware block to ensure high-quality frame features for codewords aggregation. Finally, a hard-negative guided contrastive loss is reformulated to mine more discriminative representations in order to build a better-aligned global latent space. Equipped with these techniques, our MGA-CLAP reaches a better fine-grained alignment than the original CLAP without losing its natural coarse-grained alignment, as shown in Figure 1 (b).
We conduct extensive experiments on both coarse- and fine-grained audio-text tasks. As for the fine-grained ones, MGA-CLAP surpasses the original CLAP to a large extent. Specifically, using WavCaps (Mei et al., 2023) as the main pre-training dataset, MGA-CLAP achieves 26.4%/10.1% PSDS1 on zero-shot DESED (Serizel et al., 2020)/AudioSet-Strong (Hershey et al., 2021) sound event detection tasks, which is 13.3%/6.7% higher than its baseline CLAP. While for coarse-grained retrieval and tagging tasks, our method also demonstrates noticeable improvements over CLAP and shows better performance on most evaluation protocols compared to previous SOTA works which generally require much more training resources. Besides, several ablation studies are performed to reveal the effects of each component elaborately. Finally, we also visualize the semantic meanings of specific codewords to show their roles in linking different modalities.
2. Related Work
2.1. Contrastive Language-Audio Pre-training
By pre-training on 400M image-text pairs, CLIP (Radford et al., 2021) demonstrates superior transferability on cross-modal vision problems, such as zero-shot image retrieval and classification. Several works, including AudioCLIP (Guzhov et al., 2022) and Wav2CLIP (Wu et al., 2022), try to leverage visual modality as a bridge to connect text and audio representations, achieving promising results on zero-shot audio tagging tasks. With the collection of large-scale audio caption datasets, namely AudioCaps (Kim et al., 2019), Clotho (Drossos et al., 2020) and WavText5K (Deshmukh et al., 2022), a lot of works explore contrastive language-audio pre-training without involving the visual modality. MS-CLAP (Elizalde et al., 2023) first obtains aligned text and audio encoders on a combination of off-the-shelf audio-text datasets. However, due to the limits of data size, its performance is sub-optimal. A few researchers then turn to expand the scale of audio-text datasets. LAION-Audio-630K (Wu et al., 2023a) and WavCaps (Mei et al., 2023), collected and annotated by human professionals and ChatGPT respectively, are shown to be more effective for pre-training. Besides, BLAT (Xu et al., 2023b) proposes to utilize a well-trained model together with audio tags to automatically generate captions for pre-training while Cacophony (Zhu and Duan, 2024) combines an audio caption model and Large Language Models to expand the data size to 4M and explore training strategies on such large-scale dataset. Moreover, the intrinsic shortcomings of CLAP are also studied. ACBA (Wu et al., 2023c) and CompA (Ghosh et al., 2023) enhance CLAP’s compositional reasoning ability while FLAP (Yeh et al., 2023) devises masking strategies to improve both the training efficiency and model performance. By contrast, we notice the unsatisfactory fine-grained alignment of CLAP and aim to discover both fine- and coarse-grained correspondence solely from audio-text pairs.

2.2. Audio Feature Learning with Codebook
Codebook is the key design in vector quantization (Van Den Oord et al., 2017), which is widely adopted for both understanding (Bao et al., 2021) and generation (Razavi et al., 2019) tasks. During quantization, encoder features will be substituted by their nearest-neighbor codewords in the codebook, before being utilized to reconstruct original features by the decoder. Finally, by querying the learned codebook, a continuous space can then be transformed into finite discrete tokens. Following this way, modern neural audio codec models (Zeghidour et al., 2021; Wu et al., 2023b) learn to convert the raw waveform into several codewords, paving the way for efficient audio compression (Défossez et al., 2022) and auto-regressive audio generation (Wang et al., 2023a). Besides, BEATs (Chen et al., 2022b), the state-of-the-art self-supervised learning approach, also employs an acoustic tokenizer to quantize spectrograms into codewords for mask prediction, which demonstrates better performance compared to reconstruction methods, namely AudioMAE (Huang et al., 2022). Different from the above, we leverage the codebook to accommodate both text and audio hidden representations instead of single-modality raw signals, which explicitly constructs a shared multi-modal feature space for coarse-grained alignment. Moreover, we dedicatedly redesign the computational rules of the codebook so that it can help discover the fine-grained correspondence.
2.3. Learning Frame-level Correspondence from Weak or Caption Supervision
Frame-wise labeling is extremely laborious for audio tasks, hence learning from partially labeled data (e.g., weak labels or audio captions) becomes a promising remedy. Weakly supervised sound event detection (Kumar and Raj, 2016; Lin et al., 2020) aims to recognize the sound event boundary under weak supervision, where only the clip-level annotations are provided but the exact timestamps are inaccessible. However, it solely maps acoustic features to a closed label set, which limits its applications in open-world scenarios. By contrast, learning from audio captions addresses the aforementioned problem by associating frame features with general language descriptions. But it is more challenging due to the intrinsic modality gap. Besides, the noisy information (non-sound words) contained in the captions also increases the difficulty. UACA (Xie et al., 2022) first learns relationships between sound events and textual phrases from audio captions by aggregating frame-word similarity matrix to clip-caption similarity, while WSTAG (Xu et al., 2023a) improves it by leveraging max-mean instead of mean-mean pooling. However, these works depend on exhaustive score matching while ignoring complex frame-word interaction, leading to suboptimal fine-grained alignment when scaling to a much larger pre-training dataset. In this work, we propose a novel solution to model the frame-word correspondence, demonstrating better performance and scalability than (Xie et al., 2022; Xu et al., 2023a).
3. Methodology
3.1. Overview
An overview of our MGA-CLAP is shown in Figure 2 (a). As illustrated before, we introduce a novel modality-shared codebook, which aggregates frame- and word-level features with shared codewords. Then, in order to refine the frame-wise features, the locality-aware block is involved to better capture local patterns. Finally, the CLAP loss is reformulated to emphasize indistinguishable audio-text pairs for contrastive optimization. In the following subsections, we will detail the above three core designs.
3.2. Modality-shared Codebook
3.2.1. Multi-modal Representations in CLAP
CLAP employs a bi-encoder architecture to learn the aligned feature space for both modalities. Specifically, assume that we have a batch of audio-text pairs , where , represent the th audio clip and its caption, and is the batch size. CLAP audio encoder takes as input before generating frame representations while the text encoder outputs word-level features according to , where , and is the number of frames, words and feature dimensions, respectively. Then, to obtain the global clip- and caption-level feature, an aggregator is required to map to , and works similarly to aggregate to . Finally, the symmetric contrastive loss is optimized to pull together the global features of paired audios and texts while pushing away unpaired ones in the latent space,
(1) |
where is the inner product function and is a scaling factor.
In the above CLAP paradigm, and are instantiated by mean pooling or attention pooling, suggesting that the global features are essentially weighted sum of two bases: the audio frames and language tokens. However, due to the modality gap, the two bases may exhibit different granularities and semantics thus distributed in distinct hidden spaces, making it challenging to learn the coarse-grained alignment. Moreover, the frame and word representations are separately aggregated to the global features without additional interactions, which may increase the difficulty of discovering more granular correspondence (e.g., the frame-to-word, frame-to-phrase alignment) since only the coarse-level supervision is accessible in audio-text pairs.
3.2.2. Modality-shared Codebook as the Aggregator
To seek a common multi-modal hidden space, we introduce a novel modality-shared codebook as the feature aggregator. By this means, the global audio feature and text feature are represented with the same set of learnable codewords as and , where are the mentioned codewords, and are the corresponding aggregation weights of for clip and caption , respectively. To capture rich local semantics during aggregation, we specially devise the pipeline to calculate and as shown in Figure 2 (b). Mathematically, given the extracted frame-wise features of clip , we define the affinity score between and as,
(2) |
where is the th frame feature of and is a scaling term. Notably, adopting max pooling instead of mean pooling may uncover momentary sounds even if they only last one frame, thereby guaranteeing semantic integrality during aggregation.
The affinity scores are then normalized by the Sparsemax (Martins and Astudillo, 2016) function, which works similarly to Softmax but encourages most of the elements in the probability distribution to be 0.
(3) |
By the sparse constraints, can be represented by only a few codewords, which helps eliminate the noisy activation and enhance the interpretability. Similarly, the global text feature can also be constructed via the above way.
Finally, we provide an intuitive view of how the proposed paradigm reaches fine-grained cross-modal alignment. Under the supervision of contrastive loss, the similarity of paired samples is supposed to be maximized. However, due to the sparse regularization, the model may have to resort to the same, small set of codewords to represent the audio and text to increase . Let be one of the activated codewords. It then acts as prior targets, which requires the encoders to refine frame (or word) representations to maximize and . As a result, the corresponding frame and word features are then attracted to the same anchor which contains semantic information of specific sound classes, thereby bridging the gap between multi-modal local features.
3.3. Locality-aware Encoder Block
Obtaining meaningful local representations is crucial, otherwise, some codewords may be activated by mistake during feature aggregation. Recall that in the vanilla CLAP audio encoder, the outputs of the last transformer encoder block will be decoupled to produce the final frame-wise features. Its general architecture can be found in the upper of Figure 2 (c), which first employs self-attention to consider global contexts. Specifically, let be the input sequence of the block, the query, key, value matrices are first calculated by separate linear projections ,
(4) |
Then, to compute output feature for each frame , the q-k attention is applied as follows,
(5) |
(6) |
where and is the th and th vector of and . In this way, information from other frames can be injected into the current frame by referring to the q-k similarity.

However, according to the mechanism of self-attention (Vaswani et al., 2017), we argue that computed at each location already captures rich local semantics. By contrast, obtaining a comprehensive view by attention aggregation may impurify the local patterns, which may be a negative for fine-grained alignment. Figure 3 gives two examples to support our hypothesis. As seen, the v-v similarities (computed by , which is similar to Equation (5)) of the last block are high within the same sound event while low among different events, meaning that acoustically dissimilar frames may exhibit distinct for finer-level discrimination. In comparison, the q-k similarities show inconsistency with the event boundary.
Inspired by the above, we design the locality-aware block as shown in Figure 2 (c), which simply removes the q-k attention and directly leverages the projected value matrix as the output feature sequence with other components unchanged. Besides, we only replace the last block of the audio branch, the reasons are: (1) the transformer receptive fields become much more global midway through the network as (Raghu et al., 2021) suggest; (2) since the audio encoder is pre-trained on AudioSet to learn general patterns (a widely-adopted setting of previous CLAP variants), replacing the last one can retain most prior knowledge; (3) it can strike a balance between local sensitivity and global contexts as shown in the following ablations.
3.4. Hard Negative Guided Contrastive Loss
Contrastive learning can benefit a lot from in-batch hard negative samples (Robinson et al., 2020). For vision-language tasks, several works (Wang et al., 2023b; Li et al., 2021; Yan et al., 2022) tend to resample or manually craft hard negative instances to improve alignment effects, which generally involves more training costs. In this work, we devise a simple re-weighting approach to force the modal to pay more attention to hard negative samples during optimization. The loss function is reformulated as follows,
(7) |
where is the audio-to-text and text-to-audio difficulty scores for unpaired samples, they are designed so that hard negative pairs (with higher similarity compared to the average) are emphasized, and easier pairs are neglected. Thus the model will be forced to learn a more discriminative feature space to distinguish confusable pairs for multi-grained alignment. The formula is written as,
(8) |
where is a scaling ratio, the larger it is, the more importance we attach to the hard negative samples as the distribution of and can be sharper.
4. Experimental Setup
4.1. Pre-training
Dataset. We merge WavCaps, the training set of AudioCaps and Clotho for pre-training, including about 450K audio-text pairs.
Architecture. We employ the pre-trained BERT (Kenton and Toutanova, 2019) base model as the text encoder which contains 110M parameters. While for the audio encoder, to examine the scalability of the proposed method, we adopt a patch-wise model HTS-AT (27M) (Chen et al., 2022a) and a frame-wise AST (86M) (Li et al., 2024a), all of them are trained on the AudioSet by previous works and we directly use the checkpoints. Besides, a two-layer MLP is appended after the encoder, projecting the multi-modal features into the same dimension .
Implementation Details. We train our model for 10 epochs with a batch size of 128 and a learning rate of 5e-5 using the Adam optimizer. The hyper-parameter is learnable with an initial value of 0.07 and and are fixed to 0.15 and 4096 empirically. Besides, all the audio clips and captions are randomly cropped or padded to 10 seconds and 30 words to guarantee the fixed-sized length. We also resample the waveform to 32KHz and 16KHz for HTS-AT and AST following the original works. During training, audio clips with similar durations are grouped within a batch for training efficiency. Finally, model checkpoints are selected based on their performance on validation sets after each epoch and the final model performances are evaluated on corresponding test sets. The code is released at https://github.com/Ming-er/MGA-CLAP.
4.2. Downstream Evaluation
To comprehensively evaluate the performance, we conduct experiments on several coarse-grained (including audio retrieval, audio classification, and audio tagging) and fine-grained tasks (including sound event detection and text-to-audio grounding). Note that for each specific task, we would pre-train a new model from scratch with a newly constructed dataset to be consistent with the baselines. Specifically, it will exclude all the overlapped samples in the downstream evaluation, meaning that the zero-shot inference is performed. For tasks other than retrieval, we directly use sound class names as the textual input, which avoids heavy prompt engineering. We will report the averaged metric of 3 different runs and the detailed evaluation protocols are provided in Table 1. For single-label and multi-label classification tasks, Acc and mAP are widely adopted metrics. For retrieval tasks, R@k is 1 if the positive item appears in the top k retrieved items for a query (Koepke et al., 2022). And for detection and grounding tasks, PSDS1 is more sensitive to the precise localization of sound events, followed by PSDSm and PSDS2, which may pay more attention to remove confusion between classes (Ebbers et al., 2022).
Task |
Datasets |
metrics |
||
---|---|---|---|---|
audio retrieval | AudioCaps (AC), Clotho | R@1, R@5 | ||
audio classification |
|
Acc | ||
audio tagging | FSD50K (Fonseca et al., 2021), AudioSet (AS) | mAP | ||
sound event detection |
|
PSDS1, PSDS2 | ||
text-to-audio grounding | TAG (Xu et al., 2021) | PSDSm |
5. Results
5.1. Model Performance
5.1.1. Performance on Coarse-grained Tasks
AC |
Clotho |
|||||||
Model |
Text2Audio |
Audio2Text |
Text2Audio |
Audio2Text |
||||
R@1 |
R@5 |
R@1 |
R@5 |
R@1 |
R@5 |
R@1 |
R@5 |
|
FLAP (fusion) | 41.5 | 75.5 | 53.0 | 84.1 | 20.3 |
46.5 |
25.5 | 53.4 |
Cacophony | 41.0 | 75.3 |
55.3 |
83.6 | 20.2 | 45.9 |
26.5 |
54.1 |
39.7 | 74.5 | 51.9 | 82.1 | 19.5 | 45.2 | 23.4 | 50.7 | |
41.8 |
76.1 |
54.4 | 83.6 | 20.4 | 46.0 | 25.3 | 51.2 | |
40.1 | 74.0 | 51.8 | 82.4 | 18.5 | 43.3 | 23.9 | 51.6 | |
42.2 |
74.9 | 53.7 |
84.3 |
20.8 |
45.0 |
26.5 |
54.1 |
Model |
ESC-50 |
US8K |
VGGSound |
FSD50K |
AS |
---|---|---|---|---|---|
Cacophony | 93.4 | 77.1 | 27.0 | - | - |
CompA | 89.1 |
85.7 |
29.5 | - | - |
94.7 | 80.7 | 28.6 | 52.4 | 21.1 | |
94.9 |
83.7 |
31.8 |
54.5 |
23.0 |
|
91.6 | 76.6 | 26.8 | 47.8 | 16.9 | |
92.0 | 79.4 | 29.2 | 49.7 | 19.3 |
We compare our proposed MGA-CLAP not only with the original CLAP but also with the SOTA model in each separate task. Specifically, for zero-shot retrieval, we involve FLAP (fusion) (Yeh et al., 2023) and Cacophony (Zhu and Duan, 2024) for comparison. The former is trained on LAION-Audio-630K using a more powerful audio encoder MAViL (Huang et al., 2024) and employs feature fusion proposed in (Wu et al., 2023a) to process audios longer than 10s instead of directly cropping it. And the latter is trained on a 4M audio-text dataset with LLM re-captioning, which is much larger than our 450K pairs. While for zero-shot classification, we additionally involve CompA (Ghosh et al., 2023), which leverages an instruction-tuned Flan-T5-large model (770M) (Raffel et al., 2020) as the text encoder and CompA-661K (an extension of LAION-Audio-630K) as the pre-training set.
The results concerning coarse-grained retrieval and tagging tasks are reported in Table 2 and 3, respectively. As for retrieval, the proposed MGA-CLAP largely surpasses the original CLAP no matter which backbone is applied. When compared with current SOTA methods which involve more training resources, our MGA-CLAP is also competitive, achieving the best performance on 6 of 8 metrics. And for classification and tagging tasks, similar improvements over the original CLAP can also be observed. Noticeably, our MGA-CLAP with HTS-AT encoder reaches 31.8% accuracy on VGGSound, the most complex single-label classification dataset with 300+ classes, which is 2.3% higher than the previous SOTA, CompA. These above results underscore MGA-CLAP’s ability to capture cross-modal alignment between texts and audio, leading to outstanding performance in versatile classification and retrieval tasks.
5.1.2. Performance on Fine-grained Tasks
DESED |
UrbanSED |
AS-S |
TAG |
|||
Model |
PSDS1 |
PSDS2 |
PSDS1 |
PSDS2 |
PSDS1 |
PSDSm |
UACA | 14.2 | 53.7 | 2.3 | 11.8 | 3.4 | 37.5 |
WSTAG | 17.1 | 54.3 | 3.9 | 12.6 | 4.0 | 41.7 |
PACL | 17.9 | 55.6 | 4.3 | 14.0 | 4.9 | 42.5 |
13.1 | 52.0 | 1.6 | 10.6 | 3.4 | 34.4 | |
26.4 |
58.9 |
8.7 |
19.3 |
10.1 | 48.7 | |
13.5 | 48.9 | 1.7 | 10.8 | 4.5 | 36.9 | |
25.2 | 55.5 | 7.6 | 14.9 |
10.6 |
54.8 |
We reimplement and retrain UACA (Xie et al., 2022) and WSTAG (Xu et al., 2023a) using the CLAP paradigm since the original works only experiment on tiny datasets. Besides, we also reproduce PACL (Mukhoti et al., 2023), a recent vision-language training framework, which employs cross-modal attention pooling to align local features with captions and demonstrates superior performance on fine-grained visual understanding tasks. All the above methods are implemented based on the HTS-AT backbone and we keep the training and evaluation settings consistent with MGA-CLAP.
As mentioned before, the original CLAP cannot uncover fine-grained alignment between frame features and text descriptions. As depicted in Table 4, it obtains extremely low scores, especially on time-sensitive metrics, such as PSDS1 and PSDSm. Although UCAC and WSTAG attempt to solve this problem, their results are still unpromising, since the frame-to-word interaction is modeled via simply score pooling. Besides, directly transferring PACL leads to a better but still suboptimal outcome. As a comparison, our MGA-CLAP with HTS-AT backbone obtains PSDS1 scores of 26.4%/8.7%/10.1% on DESED/UrbanSED/AudiosSet-Strong eval sets, which are 2x/5x/3x times those of the original CLAP. When switching the audio encoder to AST, better performances are witnessed on datasets containing more queries, such as AS-S and TAG.
5.2. Ablation Study
In this section, we ablate the designs of the proposed MGA-CLAP. All experiments are conducted based on the HTS-AT backbone.
5.2.1. Ablation Study on Each Sub-module
MC |
LB |
HN |
AC T2A |
AC A2T |
VGGSound |
FSD50K |
DESED |
AS-S |
TAG |
---|---|---|---|---|---|---|---|---|---|
39.7 | 51.9 | 28.6 | 52.4 | 13.1 | 3.4 | 34.4 | |||
✓ | 41.0 | 53.6 | 30.7 | 53.5 | 20.1 | 7.3 | 41.7 | ||
✓ | 39.4 | 51.8 | 28.5 | 52.9 | 21.2 | 5.6 | 41.1 | ||
✓ | ✓ | 41.2 | 53.7 | 30.9 | 53.8 | 26.5 | 9.5 | 47.6 | |
✓ | ✓ | ✓ | 41.8 | 54.4 | 31.8 | 54.5 | 26.4 | 10.1 | 48.7 |
Table 5 shows the model performance of the proposed MGA-CLAP trained with or without a specific sub-module. As seen, the incorporation of a modality-shared codebook boosts CLAP’s understanding capabilities on both coarse-grained and fine-grained tasks, as it not only adopts common bases to represent global audio and text features but also links multi-modal local features with shared codewords. However, solely training with it cannot lead to satisfactory results on fine-grained tasks due to the inferiority of frame-wise representations. When further adopting the locality-aware encoder block, the PSDS scores on DESED, AS-S, and TAG datasets are improved by 5.1%, 2.2%, and 5.9%, respectively, suggesting the necessity of acquiring high-quality frame features. Additionally, simply involving the locality-aware block can also enhance the model performance on detection and grounding tasks. Finally, further equipped with the hard-negative loss, the whole system can achieve optimal results on each task as it can enhance the contrastive learning scheme.
5.2.2. Ablation Study on the Size of Codebook


We compare the shared codebook in different sizes in Figure 5. As seen from the left figure, the R@1 scores on AudioCaps drop largely when the number of codewords increases from 4096 to 8192. We argue that the enlarged codebook may bring about noisy activated codewords and irrelevant information while aggregating, making it difficult to retrieve matched pairs. Besides, it is also at risk of underfitting since some codewords may be undertrained. By contrast, although fewer number of codewords leads to slightly better outcomes on retrieval tasks, its performance on frame-level tasks decreases a lot as shown in the right figure. The possible reason is that each codeword must convey multiple semantics within a smaller codebook, thereby disturbing the frame-word interaction while seeking fine-grained alignment. Finally, we choose the number of codewords to make a trade-off between multi-grained tasks.
5.2.3. Ablation Study on the Number of Locality-aware Block
We conduct parameter analysis on the number of vanilla transformer blocks to be replaced by the locality-aware ones in Figure 5. It can be observed that the incorporation of locality-aware blocks contributes a lot to the enhanced capability due to the refinement of frame-wise features. Additionally, adopting 1 or 2 locality-aware blocks has similar effects on the downstream tasks. However, as the number grows, the performance degradation is witnessed. This is possibly due to more locality-aware blocks destroying the information flow and pre-trained knowledge in the transformer backbone.
5.2.4. Ablation Study on the Values of

As stated before, in Equation (8) controls the difficulty of negative samples with a higher value paying more attention to harder ones. We study the effects of its numerical values in Figure 6. The results indicate that or generally yields better outcomes as a larger one may overemphasize the hard negative samples and potentially neglect the relation with other in-batch data points.
5.2.5. Ablation Study on the Design Choice of Codebook
We ablate detailed designs, namely the max pooling to compute the affinity scores and the Sparsemax to normalize aggregation weights, in the codebook and provide the outcomes in Table 6. As shown, if applying the mean pooling instead of max pooling, some non-salient local cues may be overwhelmed by the primary sound. Then severe performance drops can be found in tagging, detection and grounding tasks, where local patterns play an important role. And when changing the function to Softmax, the system produces poor results on all tasks, which is only slightly better than the original CLAP. We argue that with Softmax, the aggregation weights are no longer sparse, introducing noisy components when representing the global features. Finally, the semantics of codewords may be blurred and the connection between frame and word will be influenced.
Design |
AC T2A |
AC A2T |
VGGSound |
FSD50K |
DESED |
AS-S |
TAG |
---|---|---|---|---|---|---|---|
- | 41.8 | 54.4 | 31.8 | 54.5 | 26.4 | 10.1 | 48.7 |
(1) | 40.9 | 52.9 | 30.2 | 52.7 | 15.3 | 5.6 | 39.0 |
(2) | 40.2 | 52.3 | 28.5 | 52.6 | 13.4 | 4.1 | 35.8 |



5.3. Visualizations
5.3.1. Semantics of Codewords
In this subsection, we try to disclose the meanings of some representative codewords. For a specific codeword, we first compute its similarity with textual features of sound classes taken from AudioSet taxonomy to find out its semantics. Then we compute its similarity with frame representations to examine if it also correlates with acoustic features. The results are given in Figure 7. Taking the 1st row as an example, the 2917th codeword has a large similarity with textual descriptions related to ”dog”, suggesting its semantics. Besides, as shown in the two sub-figures, the similarity between the codeword and frame features also shows synchronization with the temporal location of sound events. Specifically, the scores are high when a dog bark is truly presented while low when it is absent. When comparing the 2nd and 3rd rows, it can be seen that the codeword can encode finer acoustic attributes, such as the gender of speakers. Finally, for the 4th row, the semantics of rare sound classes (e.g., sewing machine) can also be learned from the MGA-CLAP pipeline. And from the last figure, the semantic mapping is salient even under polyphonic environments.
5.3.2. Fine-grained Alignment
We provide several examples of MGA-CLAP achieving fine-grained alignment in Figure 8. The cases show that our method may capture both frame-to-phrase (seen from the first row) and frame-to-caption (seen from the second row) correspondence, obtaining promising results on zero-shot detection and grounding tasks. Surprisingly, it can tell apart barking and whimpering at frame-level as seen in sub-figure (e), which are both made by dogs but varied in pitches, suggesting that the subtle semantics of captions are also aligned with acoustic characteristics. Moreover, we visualize some bad cases in Figure 9. Currently, MGA-CLAP may be confused about similar sounds such as blender and vacuum cleaner (seen from Figure 9 (a)) and fail to capture long-duration dependency sometimes (seen from Figure 9 (b)). Moreover, it may omit certain sounds (such as blender in Figure 9 (c)) especially when multiple acoustic events take place simultaneously.
6. Conclusion
We devise MGA-CLAP to align audio features with language descriptions from both coarse- and fine-grained views. To achieve it, MGA-CLAP employs a codebook to construct a shared feature space for cross-modal interaction and optimize the codewords to seek frame-word correspondence. Based on the shared codebook, a novel block is designed to enhance the salience of local patterns while a re-weighting loss is considered to mine hard-negative pairs for better cross-modal alignment. By pre-training on large datasets, our MGA-CLAP not only outperforms the baseline CLAP but also yields better or competitive outcomes on versatile language-audio understanding tasks compared with SOTA variants.
Acknowledgements.
Our work is supported by the National Natural Science Foundation of China (62276250) and the Major Project of the National Social Science Foundation of China (21&ZD292).References
- (1)
- Bao et al. (2021) Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
- Chen et al. (2020) Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721–725.
- Chen et al. (2022a) Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022a. HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 646–650.
- Chen et al. (2022b) Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. 2022b. Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058 (2022).
- Défossez et al. (2022) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438 (2022).
- Deshmukh et al. (2022) Soham Deshmukh, Benjamin Elizalde, and Huaming Wang. 2022. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275 (2022).
- Drossos et al. (2020) Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 736–740.
- Ebbers et al. (2022) Janek Ebbers, Reinhold Haeb-Umbach, and Romain Serizel. 2022. Threshold independent evaluation of sound event detection scores. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1021–1025.
- Elizalde et al. (2023) Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. 2023. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Fonseca et al. (2021) Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2021. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), 829–852.
- Gemmeke et al. (2017) Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780.
- Ghosh et al. (2023) Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S Ramaneswaran, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. 2023. CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models. arXiv preprint arXiv:2310.08753 (2023).
- Gong et al. (2022) Yuan Gong, Cheng-I Lai, Yu-An Chung, and James Glass. 2022. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10699–10709.
- Guzhov et al. (2022) Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976–980.
- Hershey et al. (2021) Shawn Hershey, Daniel PW Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. 2021. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 366–370.
- Huang et al. (2024) Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer, et al. 2024. Mavil: Masked audio-video learners. Advances in Neural Information Processing Systems 36 (2024).
- Huang et al. (2022) Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. Advances in Neural Information Processing Systems 35 (2022), 28708–28720.
- Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
- Kim et al. (2019) Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132.
- Koepke et al. (2022) A Sophia Koepke, Andreea-Maria Oncescu, Joao Henriques, Zeynep Akata, and Samuel Albanie. 2022. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia (2022).
- Kong et al. (2020) Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2880–2894.
- Kumar and Raj (2016) Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In Proceedings of the 24th ACM international conference on Multimedia. 1038–1047.
- Li et al. (2021) Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694–9705.
- Li et al. (2024a) Xian Li, Nian Shao, and Xiaofei Li. 2024a. Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2024).
- Li et al. (2024b) Yiming Li, Xiangdong Wang, and Hong Liu. 2024b. Audio-Free Prompt Tuning for Language-Audio Models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 491–495.
- Li et al. (2024c) Yiming Li, Xiangdong Wang, Hong Liu, Rui Tao, Long Yan, and Kazushige Ouchi. 2024c. Semi-Supervised Sound Event Detection with Local and Global Consistency Regularization. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 271–275.
- Lin et al. (2020) Liwei Lin, Xiangdong Wang, Hong Liu, and Yueliang Qian. 2020. Specialized decision surface and disentangled feature for weakly-supervised polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 1466–1478.
- Liu et al. (2023) Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. Proceedings of the International Conference on Machine Learning (2023).
- Martins and Astudillo (2016) Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning. PMLR, 1614–1623.
- Mei et al. (2023) Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2023. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395 (2023).
- Mukhoti et al. (2023) Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip HS Torr, and Ser-Nam Lim. 2023. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19413–19423.
- Piczak (2015) Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
- Raghu et al. (2021) Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems 34 (2021), 12116–12128.
- Razavi et al. (2019) Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019).
- Robinson et al. (2020) Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. 2020. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592 (2020).
- Salamon and Bello (2017) Justin Salamon and Juan Pablo Bello. 2017. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal processing letters 24, 3 (2017), 279–283.
- Salamon et al. (2014) Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia. 1041–1044.
- Salamon et al. (2017) Justin Salamon, Duncan MacConnell, Mark Cartwright, Peter Li, and Juan Pablo Bello. 2017. Scaper: A library for soundscape synthesis and augmentation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 344–348.
- Serizel et al. (2020) Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. 2020. Sound event detection in synthetic domestic environments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 86–90.
- Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
- Wang et al. (2023a) Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023a. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).
- Wang et al. (2023b) Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, and Yankui Sun. 2023b. Vilta: Enhancing vision-language pre-training through textual augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3158–3169.
- Wu et al. (2023c) Ho-Hsiang Wu, Oriol Nieto, Juan Pablo Bello, and Justin Salomon. 2023c. Audio-Text Models Do Not Yet Leverage Natural Language. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Wu et al. (2022) Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4563–4567.
- Wu et al. (2023a) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023a. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Wu et al. (2023b) Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and Alexander Richard. 2023b. Audiodec: An Open-Source Streaming High-Fidelity Neural Audio Codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Xie et al. (2022) Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. 2022. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8867–8871.
- Xie et al. (2023) Huang Xie, Okko Räsänen, and Tuomas Virtanen. 2023. On Negative Sampling for Contrastive Audio-Text Retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Xin et al. (2023) Yifei Xin, Dongchao Yang, and Yuexian Zou. 2023. Improving Text-Audio Retrieval by Text-Aware Attention Pooling and Prior Matrix Revised Loss. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Xu et al. (2021) Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. 2021. Text-to-audio grounding: Building correspondence between captions and sound events. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 606–610.
- Xu et al. (2023a) Xuenan Xu, Mengyue Wu, and Kai Yu. 2023a. Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 1–5.
- Xu et al. (2023b) Xuenan Xu, Zhiling Zhang, Zelin Zhou, Pingyue Zhang, Zeyu Xie, Mengyue Wu, and Kenny Q. Zhu. 2023b. BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data. In Proceedings of the 31st ACM International Conference on Multimedia. 2756–2764.
- Yan et al. (2022) Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. 2022. Generative negative text replay for continual vision-language pretraining. In European Conference on Computer Vision. Springer, 22–38.
- Yeh et al. (2023) Ching-Feng Yeh, Po-Yao Huang, Vasu Sharma, Shang-Wen Li, and Gargi Gosh. 2023. Flap: Fast language-audio pre-training. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 1–8.
- Zeghidour et al. (2021) Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), 495–507.
- Zhu and Duan (2024) Ge Zhu and Zhiyao Duan. 2024. Cacophony: An Improved Contrastive Audio-Text Model. arXiv preprint arXiv:2402.06986 (2024).