MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Abstract
The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE’s superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace.
Index Terms:
Automated Audio Captioning, Evaluation Metric, Audio-Language ModelsI Introduction
Automated Audio Captioning (AAC) task [1] is centered on producing natural language descriptions for audio content. This process involves identifying audio events [2], acoustic scenes [3], temporal relationships [4], etc., within the audio stream. Once trained, an AAC system has numerous applications [1], such as assisting individuals with hearing impairments, enhancing security and surveillance systems, supporting multimedia retrieval, and more.
Building a robust AAC system requires evaluating outputs on three main dimensions: accuracy (covering all audio events, scenes, actions), linguistic quality (grammar, coherence), and readability (clarity and logical flow). Traditional AAC metrics, such as BLEU [5], ROUGE [6], and METEOR [7], emphasize linguistic variation [8] through n-gram overlap between candidate and reference sentences. Metrics like SPICE [9] incorporate relational information by parsing captions into a graph containing semantic elements, their attributes, and relations to one another, and evaluates the candidate graph via synonym lemma matching. Recently, FENSE [10], developed specifically for audio captioning, leverages sentence-BERT [11] embeddings to capture semantic similarity between generated and reference captions. Subsequent methods [12, 13] have built upon this approach, predominantly employing sentence-BERT embeddings. However, a key limitation of existing metrics is their exclusion of audio information and needing reference captions to perform evaluation.
We hypothesize that incorporating audio information into AAC metrics will enhance semantic accuracy and better align with human judgment. Audio information can be integrated either through direct audio-caption comparison or by grounding text embeddings in audio. For example, a generated caption like “The crowd is applauding in a stadium” versus the reference “The crowd is silent in a stadium” would be scored similarly by current metrics like FENSE, despite opposite meanings. Grounding embeddings in audio, however, could yield lower similarity scores for such differences. CLAP [14, 15, 16, 17] offers one approach, aligning audio and text in a multimodal space to capture shared audio events and scenes. Yet, solely using CLAP may miss essential linguistic and readability factors [18], underscoring the need for metrics that integrate both linguistic quality and audio context for comprehensive evaluation.
In this paper, we propose MACE, a novel and comprehensive metric to evaluate audio captioning system. MACE addresses a fundamental limitation in existing evaluation approaches by incorporating both audio and linguistic information. The metric comprises three components: first, it leverages CLAP audio and text embeddings to assess the relevance of the generated caption with respect to the audio content; second, it employs text embeddings to measure acoustic similarity between the generated and reference captions; third, it applies a fluency error penalty to weight the similarity scores, ensuring grammatical accuracy in the generated captions. We evaluate MACE on two commonly used benchmarks for evaluating AAC metrics - AudioCaps-Eval [10] and Clotho-Eval [10] benchmarks. MACE produces SoTA results and outperforms all prior metrics. Specially, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, as MACE contains three components, one can evaluate caption quality without requiring reference caption by using two of the three components.
II Related Work
Linguistic metrics. Audio captioning evaluation has drawn from fields like NLP and image captioning. Traditional metrics, such as BLEU [5] and ROUGE [6], rely on N-gram matching between candidate and reference captions, but struggle in audio captioning, where diverse descriptions can accurately represent the same sound. Advanced metrics like METEOR [7] introduced synonym-matching and stemming for better semantic alignment, while CIDEr[19] used TF-IDF weighting to emphasize rare, meaningful N-grams. SPICE, developed for image captioning, compares “object-graphs” between captions to model conceptual relationships, and SPIDEr[20], combining SPICE and CIDEr, aimed to improve robustness and alignment with human judgment.
Embedding-based metrics. BERT-score [21], BLUERT [22] and Sentence-BERT encode candidate and reference sentences as vectors using pre-trained language models, computing distances between these vectors to assess semantic similarity. This approach has shown promise in capturing more subtle semantic relationships that N-gram based methods often miss. In the specific context of audio captioning, FENSE [10] emerged as a significant advancement. Building upon the embedding-based approach, FENSE incorporates an additional fluency detection mechanism to address the issue of semantically similar but non-fluent captions. Recent efforts have tried to combine parsing-based approaches with embedding methods in two-stage frameworks. SPICE+ [12] and ACES[23] both employ initial parsing stages—either generating parse graphs or extracting explicit sound descriptors—followed by Sentence-BERT embedding comparisons.
LLM-based metrics. The rapid advancement of large language models (LLMs) like GPT-4 [24] has opened new avenues for evaluation metrics. X-ACE [13] replaces fixed components in SPICE with LLM-based parsers. CLAIRA [25] represents another innovative approach, directly leveraging an LLM and in-context learning [26] to produce a numeric score between 0 and 100 quantifying the similarity between candidate and reference captions. This method bypasses intermediate representations, relying instead on the LLM’s inherent understanding of language and context. Despite these advancements, a limitation across all these metrics: they operate solely in the textual domain, comparing generated captions to reference captions without considering the audio signal.
Clotho-Eval | AudioCaps-Eval | |||||||||
Metric | HC | HI | HM | MM | All | HC | HI | HM | MM | All |
BLEU@1 | 51.0 | 90.6 | 65.5 | 50.3 | 59.0 | 58.6 | 90.3 | 77.4 | 50.3 | 62.4 |
BLEU@4 | 52.9 | 88.9 | 65.1 | 53.2 | 60.5 | 54.7 | 85.8 | 78.7 | 50.6 | 61.6 |
METEOR | 54.8 | 93.0 | 74.6 | 57.8 | 65.4 | 66.0 | 96.4 | 90.0 | 60.1 | 71.7 |
ROUGE-L | 56.2 | 90.6 | 69.4 | 50.7 | 60.5 | 61.1 | 91.5 | 82.8 | 52.1 | 64.9 |
CIDEr | 51.4 | 91.8 | 70.3 | 56.0 | 63.2 | 56.2 | 96.0 | 90.4 | 61.2 | 71.0 |
SPICE | 44.3 | 84.4 | 65.5 | 48.9 | 56.3 | 50.2 | 83.8 | 77.8 | 49.1 | 59.7 |
SPICE+ | 46.7 | 88.1 | 70.3 | 48.7 | 57.8 | 59.1 | 85.4 | 83.7 | 49.0 | 62.0 |
ACES | 56.7 | 95.5 | 82.8 | 69.9 | 74.0 | 64.5 | 95.1 | 89.5 | 82.0 | 83.0 |
SPIDEr | 53.3 | 93.4 | 70.3 | 57.0 | 64.2 | 56.7 | 93.4 | 70.3 | 57.0 | 64.2 |
FENSE | 60.5 | 94.7 | 80.2 | 72.8 | 75.7 | 64.5 | 98.4 | 91.6 | 84.6 | 85.3 |
CLAIRA (+ Gemini-v1.5) | 59.0 | 95.9 | 83.2 | 75.1 | 77.4 | 70.4 | 99.2 | 93.7 | 81.5 | 84.9 |
CLAIRA (+ GPT-4o) | 62.4 | 97.1 | 83.6 | 77.9 | 79.7 | 70.9 | 99.2 | 93.3 | 84.6 | 86.6 |
MACE | 63.3 | 98.0 | 80.6 | 77.0 | 79.0 | 74.4 | 99.2 | 94.6 | 86.3 | 88.1 |
III Multimodal Audio-Caption Evaluation
The MACE metric utilizes the Contrastive Language-Audio Pre-training (CLAP) model to overcome limitations of text-only evaluation methods by generating embeddings for both audio and text, enabling evaluation that accounts for acoustic content alongside textual descriptions. MACE is computed through three components: (1) CLAP audio and text embeddings assess the relevance of the generated caption to the audio content, (2) text embeddings evaluate acoustic similarity between the generated and reference captions, and (3) a fluency error penalty ensures grammatical accuracy. Additionally, MACE can assess caption quality without a reference by using two of the three components.

Let denote the source audio, the candidate caption, and the set of reference captions. Then MACE metric is computed using three components (Fig. 1):
-
•
Audio-text. This score measures the similarity between the candidate caption and the audio content. The score is obtained by taking cosine similarity () between the CLAP audio embedding () of the source audio and CLAP text embeddings () of the caption:
(1) -
•
Text-text. This score measures the similarity between the candidate caption and the reference caption(s). The score is obtained by taking dot product between the CLAP text embedding () of the candidate caption and the CLAP text embeddings () of the reference(s):
(2) -
•
Fluency error. This score, first introduced in FENSE [10], uses a BERT model trained to detect fluency errors like incomplete sentences, repeated elements, and missing conjunctions or verbs, penalizing captions lacking coherence or grammar. If the error probability exceeds a set threshold, the embedding similarity score is reduced by multiplying it by , where is a weighting factor.
The MACE metric is a combination of the above three components. It is computed as:
(3) |
where is a weighting factor, FP is FluencyPenalty which can either be 1 or 0 based on FluencyPenalty scores, and the audio score is average of the audio-text and text-text scores
MACE captures semantic similarity, acoustic relevance, and linguistic quality within a single metric. By utilizing CLAP’s multimodal capabilities, it overcomes the core limitation of existing metrics that overlook the audio signal, offering a more comprehensive and accurate evaluation of the caption quality.
IV Experimental setup
We conduct experiments on two widely-used datasets: Clotho-Eval and AudioCaps-Eval [10]. These datasets provide pairwise human annotations for caption evaluation, comprising 1,671 and 1,750 pairs of audio captions, respectively. To test MACE, we sourced the corresponding audio files for these caption pairs from the original AudioCaps [27] and Clotho [28] datasets. The audio files in AudioCaps-Eval have a duration of 10 seconds, while those in Clotho-Eval range from 15 to 30 seconds. For the CLAP model, we use MSCLAP 2023 [16] model which uses HTSAT [29] as audio encoder and GPT2 [30] as text encoder. The audio is resampled to 44.1 kHz for CLAP. Natively, the CLAP model only supports 7-seconds duration for audio. To support longer audios, we split the audio into 7s clips and take the mean of the corresponding audio embeddings weighted by the duration.


V Results
V-A Evaluating MACE
To evaluate the effectiveness of MACE, we conducted comprehensive benchmarking experiments on two widely-used datasets: Clotho-Eval and AudioCaps-Eval. The task consists of evaluating a pair of captions and selecting the better caption of the two. One of the captions in each pair is the preferred caption based on the scores assigned by human evaluators. TThe primary metric for comparison is pair accuracy, which measures a metric’s ability to correctly identify the human-preferred caption in each pair. To better understand and analyze the results, the tests are categorized into four groups: HC (two correct human captions), HI (one correct and one incorrect human caption), HM (one human and one machine-generated caption), and MM (two machine-generated captions). We compare MACE against a diverse set of metrics such as BLEU@1, BLEU@4, METEOR, ROUGE-L, CIDEr, SPICE, SPICE+, ACES, SPIDEr, X-ACE, FENSE and CLAIRA (with Gemini-v 1.5 and GPT-4o). all metrics on two categories (HC and HI) and is only slightly behind CLAIREA on the other two metrics (HM MACEM) and the overall accuracy (79.0% compared to 79.7%). The performance of CLAIREA dMACEs on the ability of GPT-4o to identify which of the two captions in a given pair is better - information which it may have already seen during training. Instead of GPT-4o, if one uses Gemini v1.5 (pro) [31], the accuracy falls down to 77.4%. On AudioCaps-Eval, MACE outperforms other metrics across all categories. The slight drop in performance on Clotho-Eval compared to AudioCaps-Eval, can be attributed to differences in audio content duration (Clotho-Eval contains longer audio clips) and content type (AudioCaps-Eval contains more speech). Notably, MACE demonstrates substantial improvements over the widely used FENSE metric on both the datasets: AudioCaps-Eval (75.7% to 79.0%) and Clotho-Eval (85.3% to 88.1%).
V-B Ablation analysis of MACE’s audio components
In this section, we perform ablation study to understand the contribution of different audio components. The Table II shows the performance of MACEAT (using fluency score with Saudio-text - the audio-text component of MACE), MACETT (using fluency score with Stext-text - the text-text component of MACE) and the final MACE metric on the Clotho-Eval and AudioCaps-Eval datasets. Our ablation study reveals a consistent trend in human preference match accuracy across all categories in both datasets: . The performance gap between and can be attributed to two factors. First, the CLAP audio encoder is trained for 7-second audio inputs, while the dataset contains 10-30 second audio clips, which is not optimal for CLAP and leads to information loss. Second, MACETT benefits from averaging similarities across five caption-reference pairs, potentially offering more robust evaluations. MACE’s performance over its components indicates that combining audio-text and text-text similarities captures complementary aspects of caption quality, aligning well with human preferences.
Test Data | Metric | HC | HI | HM | MM | All |
Clotho | FENSE | 60.5 | 94.7 | 80.2 | 72.8 | 75.7 |
Clotho | MACEAT | 61.4 | 97.1 | 80.2 | 74.0 | 76.8 |
Clotho | MACETT | 60.5 | 97.1 | 79.3 | 75.9 | 77.7 |
Clotho | MACE | 63.3 | 98.0 | 80.6 | 77.0 | 79.0 |
AudioCaps | FENSE | 64.5 | 98.4 | 91.6 | 84.6 | 85.3 |
AudioCaps | MACEAT | 68.5 | 98.4 | 91.2 | 78.7 | 82.6 |
AudioCaps | MACETT | 66.5 | 99.6 | 92.5 | 85.6 | 86.4 |
AudioCaps | MACE | 74.4 | 99.2 | 94.6 | 86.3 | 88.1 |
V-C MACE as objective metric
MACE can also be used as an objective metric [32] which does not require ground truth or reference caption for computation. To achieve this, we compute MACE but skip computing and adding score in Fig 1. This is equivalent to score in Section V-B and Table II. Compared with FENSE which utilizes reference caption, does not utilize reference caption is comparable or only slightly underperforms FENSE. Specifically, outperforms FENSE by 1.45% on Clotho-Eval and underperforms FENSE by 3.16% on AudioCaps-Eval. This makes MACE ideal for large-scale audio captioning evaluations [33, 34, 35] where annotations are either unavailable, costly, or restricted due to privacy concerns for users.
V-D Ablation analysis of fluency detection
To optimize the fluency penalty parameters, we conduct analysis using a 20% validation set from lotho . We systematically varied the threshold from 0.90 to 1.00 in increments of 0.01, and the penalty coefficient from 0.1 to 1.0 in increments of 0.1. The results of this analysis are presented in Figure 2.
Our findings indicate that the optimal threshold for error detection should be set significantly higher (0.97) than the default of 0.9, implying that the BERT model requires greater confidence to accurately identify fluency errors in audio captioning. Additionally, the penalty coefficient should be much lower (0.3) than the default of 0.9. This need for a higher detection threshold and reduced penalty suggests that the BERT model’s predictions do not directly align with human judgments of caption quality for the task of automated audio captioning. This can stem from the nature of AAC, where minor grammatical errors are less impactful on the overall quality of the generated caption.
Reference: Yelling and then siren and horn | H | sBERT | ||
A large engine passes as people speak followed by a siren. | ✓ | 0.50 | 0.74 | 0.36 |
High pitched vibrations and humming of a power tool with some rustling. | ✗ | 0.54 | 0.38 | 0.03 |
Reference: someone running across a field made of dirt. | H | sBERT | ||
Footsteps crunch on a gravel path at a steady pace. | ✓ | 0.34 | 0.77 | 0.66 |
A person is walking on a gravel path. | ✗ | 0.51 | 0.75 | 0.58 |
V-E Qualitative Evaluation
One of MACE’s contributions is the integration of audio information (CLAP), rather than relying on traditional text-based embeddings. We perform a qualitative comparison of CLAP embedding similarity with the commonly used Sentence-BERT embedding similarity. Table III provides representative examples from AudioCaps-Eval and Clotho-Eval, showing a reference caption, a pair of candidate captions, and the similarity scores from Sentence-BERT, () and (). The H column indicates human preferred caption out of the two. We see that CLAPAT and CLAPTT have higher similarity values compared to Sentence BERT for the human-preferred caption (marked by ✓) due to their ability to encode the underlying audio context.
VI Conclusion
We present MACE, a novel metric for evaluating audio captions that integrates both audio and textual information, effectively addressing key limitations in current evaluation methods. By combining audio-text and text-text correspondences with an enhanced fluency penalty, MACE aligns more closely with human perceptions and preferences. Experimental results on standard benchmarks show that MACE outperforms existing metrics in correlation with human judgments, achieving a 3.28% and 4.36% relative accuracy improvement over the FENSE metric and significantly surpassing traditional metrics used in the Automated Audio Captioning task.
References
- [1] X. Mei, X. Liu, M. D. Plumbley, and W. Wang, “Automated audio captioning: An overview of recent progress and new challenges,” EURASIP journal on audio, speech, and music processing, vol. 2022, no. 1, p. 26, 2022.
- [2] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- [3] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.
- [4] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” Advances in Neural Information Processing Systems, vol. 36, pp. 18 090–18 108, 2023.
- [5] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- [6] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
- [7] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds. Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. [Online]. Available: https://aclanthology.org/W05-0909
- [8] S. Deshmukh, B. Elizalde, D. Emmanouilidou, B. Raj, R. Singh, and H. Wang, “Training audio captioning models without audio,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 371–375.
- [9] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 382–398.
- [10] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, “Can audio captions be evaluated with image caption metrics?” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 981–985.
- [11] N. Reimers, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
- [12] F. Gontier, R. Serizel, and C. Cerisara, “Spice+: Evaluation of automatic audio captioning systems with pre-trained language models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [13] Q. Wang, J.-C. Gu, and Z.-H. Ling, “X-ace: Explainable and multi-factor audio captioning evaluation,” in Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 12 273–12 287.
- [14] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [15] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [16] B. Elizalde, S. Deshmukh, and H. Wang, “Natural language supervision for general-purpose audio representations,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 336–340.
- [17] A. K. Sridhar, Y. Guo, E. Visser, and R. Mahfuz, “Parameter efficient audio captioning with faithful guidance using audio-text shared latent representation,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1181–1185.
- [18] S. Kothinti and D. Emmanouilidou, “Investigations in audio captioning: Addressing vocabulary imbalance and evaluating suitability of language-centric performance metrics,” 2023. [Online]. Available: https://arxiv.org/abs/2211.06547
- [19] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
- [20] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 873–881.
- [21] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
- [22] T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning robust metrics for text generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7881–7892. [Online]. Available: https://aclanthology.org/2020.acl-main.704
- [23] G. Wijngaard, E. Formisano, B. L. Giordano, and M. Dumontier, “Aces: Evaluating automated audio captioning models on the semantics of sounds,” in 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023, pp. 770–774.
- [24] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- [25] T.-H. Wu, J. E. Gonzalez, T. Darrell, and D. M. Chan, “Clair-a: Leveraging large language models to judge audio captions,” arXiv preprint arXiv:2409.12962, 2024.
- [26] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
- [27] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- [28] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 736–740.
- [29] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
- [30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- [31] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
- [32] S. Deshmukh, D. Alharthi, B. Elizalde, H. Gamper, M. Al Ismail, R. Singh, B. Raj, and H. Wang, “Pam: Prompting audio-language models for audio quality assessment,” in Interspeech 2024, 2024, pp. 3320–3324.
- [33] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
- [34] L. Sun, X. Xu, M. Wu, and W. Xie, “Auto-acd: A large-scale dataset for audio-language representation learning,” in Proceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 5025–5034. [Online]. Available: https://doi.org/10.1145/3664647.3681472
- [35] S. Dixit, L. Heller, and C. Donahue, “Vision language models are few-shot audio spectrogram classifiers,” in Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=RnBAclRKOC