Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Kaiyi Luo^1,2†, Xulong Zhang^1†, Jianzong Wang^1∗, Huaxiong Li², Ning Cheng¹, Jing Xiao¹ ^† Both authors have made equal contributions.^∗Corresponding author: Jianzong Wang ([email protected]) ¹Ping An Technology (Shenzhen) Co., Ltd.
²Department of Control Science and Intelligent Engineering, Nanjing University

Abstract

Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.

Index Terms:

Cross-modal Retrieval, Data Reconstruction, Contrastive Learning

I Introduction

The explosion of multi-modal data has sparked significant interest in cross-modal retrieval, which is aimed to conduct similarity searches across different modalities [1, 2, 3]. For example, in a search engine, one expects relevant results such as videos and pictures with text inputs. However, the heterogeneous data structures of different modalities pose a great challenge to similarity measurement [4]. The fundamental goal of cross-modal retrieval is to establish correlations between diverse forms of data, such as images, texts, audios, and videos. A frequently used approach involves creating a shared latent space in which instances with similar semantics from various modalities are positioned close to one another, enabling the system to perform efficient retrieval tasks [5]. Given a textual query, a cross-modal retrieval system can transform the textual feature into a common space, and decide its corresponding image by computing distances among data.

In recent years, numerous CMR methods have been proposed [6, 7, 8, 9, 10], which fall into two categories: optimization-based methods and deep methods. optimization-based methods such as [9] reframe the challenge of CMR as an optimization task, which can be solved efficiently with different optimization schemes such as Alternating Direction Method of Multipliers (ADMM). Deep learning based methods extract robust representations with neural networks, which contains less noise and more discriminative features. Generally, deep methods outperform shallow ones due to the powerful representation capacity of neural networks. Though the existing CMR methods can achieve satisfying performance, these studies are focused on text-image retrieval and text-video retrieval, without considerations for audio-text retrieval, which is more challenging because audio information may contain noise and multiple sources. To our knowledge, there are several works dedicated to audio-text retrieval [11, 12, 13, 14]. The basic method is to apply different network structures and contrastive learning to unify features from the two modalities. Despite the good performance produced by these methods, there are several problems to be solved: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they only considers cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which impairs the performance of these methods.

To tackle the aforementioned challenges, this paper introduces a novel approach for audio-text retrieval, termed as Contrastive Latent Space Reconstruction (CLSR). For data processing, raw audio data are transformed into log Mel-spectrograms, which depict fine-grained acoustic features and can be processed like images. textual data are converted into word embeddings with the BERT model [15], which contain high-level correlations between words. To construct the shared space, CLSR extends the existing NT-Xent contrastive loss[16] to increase feature discrimination, and adopts adaptive temperature control, which increases the positive compactness and negative separability. Moreover, a latent reconstruction module is designed for each modality for better semantic alignment. The main contribution of CLSR can be summarized as follows:

•

We introduce a novel CMR method, i.e., CLSR, which provides a new perspective for audio-text CMR problems.
•

Considering the property of audio-text datasets, we adopt a temperature-adaptive contrastive method, which enhances the discrimination of latent representations based on semantic alignment.
•

Experimental results across multiple datasets illustrate that CLSR surpasses certain state-of-the-art cross-modal methods.

II Related Work

II-A Audio-Text Retrieval

Despite the effectiveness of the previous methods, audio-text retrieval is less studied due to the difficulty of discovering discriminative representations for audio clips. Chechik et al. [11] proposed a scalable machine learning method with the passive-aggressive model (PAMIR). Ikawa et al. [12] constructs a latent variable space with the onomatopoeia generation. Elizalde et al. [13] utilize a siamese network to jointly map audio and text features into a shared space. However, these methods are restricted by the form of queries. Andreea Maria et al. [14] exploits Mixture-of-Embedded Experts (MoEE) [17] and Collaborative-Experts (CE) [18] for the purpose of audio-text retrieval involving natural language descriptions. Lou et al. [19] generate audio-text embeddings utilizing pre-trained audio neural networks (PANNs) [20] and employ NetRVLAD pooling [21]. Mei et al. [22] incorporates contrastive learning to increase feature discrimination.

II-B Contrastive learning

Contrastive learning is an unsupervised paradigm designed to enhance the quality of representations by promoting the closeness of positive pairs and the distinctiveness of negative ones. InfoNCE [23] is proposed to maximize the lower bound of mutual information. NT-Xent [16] aims to promote multi-modal representation learning by building two symmetric contrastive losses.BYOL [24] achieves promising results without negative pairs. Decoupled Contrastive Learning (DCL) [25] improves the effectiveness by removing the positive pairs in the denominator of the contrastive loss. Recently, Huang et al. [26] proposed Model-Aware Contrastive Learning (MACL) to solve uniformity-tolerance dilemma and gradient reduction. These contrastive methods have been proven effective experimentally.

III Proposed Method

III-A Notations

In this paper, The multimodal dataset is written as $\mathcal{D}=\{\mathrm{A},\mathrm{T}\}$ , where $\mathrm{A}=[a_{1},a_{2},...,a_{m}]$ and $\mathrm{T}=[t_{1},t_{2},...,t_{m}]$ represent the audio modality and the text modality, respectively. A data batch is denoted as $O=[o_{1},o_{2},...,o_{b}]$ , where $b$ is the batch size and $o_{j}=[a_{j},t_{j}]$ is the $j$ th audio-text pair. The relevance between two instances is measured by cosine similarity:

\cos(x,y)=\frac{x^{\mathrm{T}}y}{\|x\|_{2}\|y\|_{2}},

(1)

where $x$ and $y$ are two vectors.

III-B Model Formulation

The proposed CLSR framework is depicted in Fig. 1, which comprises three parts: the feature extraction module, the confidence-aware contrastive module and the modality reconstruction module.

III-B1 Feature Extraction Module

Deep neural networks are prevalent for the strong capacity to capture high-level semantic features. To process audio data, mel frequency spectrogram is widely exploited in speech synthesis (TTS) and voice conversion [27, 28, 29, 30]. Mel frequency spectrogram is robust to noise and signal variations in audio signals, for it decomposes the audio signal into several windows, and the signal within each window is stationary, alleviating the disturbance of noise and signal variations. We apply the convolutional network to generate high-level audio features, which is similar to most existing works involving the image modality. To extract semantic features from texts, some literature directly uses bag-of-word vectors and apply MLPs for feature extraction, which loses underlying semantic information. Recently, some researchers propose the textual transformer [15] to uncover the correlations between words, thus creating more robust semantic-aware features. Inspired by this, we employ the BERT model for textual feature extraction. Therefore, modality-specific features can be obtained:

\begin{aligned} &\mathrm{F}^{a}=\mathrm{E}_{a}\left(\mathrm{A},\theta_{a}\right)\in\mathbb{R}^{b\times d_{a}}\\ &\mathrm{F}^{t}=\mathrm{E}_{t}\left(\mathrm{T},\theta_{t}\right)\in\mathbb{R}^{b\times d_{t}}\end{aligned},

(2)

Refer to caption — Figure 1: The overall workflow of CLSR.

where $\theta_{a}$ and $\theta_{t}$ denote network parameters in two feature extractors, respectively. In this way, semantic correlations can be explored based on the representations.

III-B2 Confidence-aware Contrastive Loss

With the latent representations extracted from the feature extraction modules, we design embedding heads for both modalities to align the modality dimensions. Through the embedding heads, audio and textual features are forced into a shared latent space to facilitate semantic consistency. The extracted features are fed into the embedding head to generate continuous-valued representations. both embedding heads consist of fully connected layers with ReLU activation functions. The process can be described as follows:

\begin{aligned} &\mathrm{Z}^{a}=\mathrm{E}_{a}\left(\mathrm{F}^{a},\epsilon_{a}\right)\in\mathbb{R}^{b\times c}\\ &\mathrm{Z}^{t}=\mathrm{E}_{t}\left(\mathrm{F}^{t},\epsilon_{t}\right)\in\mathbb{R}^{b\times c}\end{aligned},

(3)

where $c$ is the dimension for the shared latent space.

With the lack of annotated labels, semantic alignment can be achieved by data co-occurrence relationship across modalities. Contrastive learning, as a type of unsupervised learning, aims to learn feature representations that can discriminate between data samples by comparing their similarity. Specifically, for a mini-batch contrastive learning strives to bring data samples of positive pairs closer while simultaneously pushing apart those of negative pairs, and the audio-to-text loss is defined as follows:

\mathcal{L}_{a2t}=-\frac{1}{b}\sum_{i=1}^{b}\log\frac{\exp\left(a_{i}t_{i}^{\mathrm{T}}/\tau\right)}{\exp\left(a_{i}t_{i}^{\mathrm{T}}/\tau\right)+\sum_{j=1}^{K}\exp\left(a_{i}t_{j}^{\mathrm{T}}/\tau\right)},

(4)

where $\{a_{i},t_{i}\}$ is a positive pair, and $K$ is the number of negative pairs for the $i$ th instance. And NT-Xent loss [16, 22], which is widely used in multi-modal representation learning, can be written as follows:

\begin{aligned} \mathcal{L}_{NT-XENT}&=\mathcal{L}_{a2t}+\mathcal{L}_{t2a}\\ &=-\frac{1}{b}\sum_{i=1}^{b}\log\frac{\exp\left(a_{i}t_{i}^{\mathrm{T}}/\tau\right)}{\exp\left(a_{i}t_{i}^{\mathrm{T}}/\tau\right)+\sum_{j=1}^{K}\exp\left(a_{i}t_{j}^{\mathrm{T}}/\tau\right)}\\ &-\frac{1}{b}\sum_{i=1}^{b}\log\frac{\exp\left(t_{i}a_{i}^{\mathrm{T}}/\tau\right)}{\exp\left(t_{i}a_{i}^{\mathrm{T}}/\tau\right)+\sum_{j=1}^{K}\exp\left(t_{i}a_{j}^{\mathrm{T}}/\tau\right)}\end{aligned}.

(5)

However, this form of contrastive learning is implemented by only taking inter-modal transformation into consideration, neglecting the intra-modal instance separability, which limits the performance of representation learning. CLSR considers expanding Eq. (5) into the following form:

\mathcal{L}_{con}=\mathcal{L}_{a2t}+\mathcal{L}_{t2a}+\mathcal{L}_{a2a}+\mathcal{L}_{t2t}

(6)

where $\mathcal{L}_{a2a}$ and $\mathcal{L}_{t2t}$ are contrastive losses applied within the audio modality and the text modality, respectively. Thus, inter-modal and intra-modal contrastive losses are constructed, which further increases multi-modal semantic consistency.

Existing works prove that the temperature parameter is an important factor to undermine hard negative samples, which, however, is often designed empirically and intuitively. Inspired by [26], we propose a confidence-aware temperature scheme to control the temperatures of each sample according to the semantic alignment:

\tau=\tau_{0}\cdot\gamma^{\operatorname{Tr}\left(\cos\left(\mathrm{Z}^{a},\mathrm{Z}^{t}\right)\right)/b},

(7)

where $\tau_{0}$ is the initial temperature, $\gamma>0$ is the scaling factor, and $\operatorname{Tr}\left(\cos\left(\mathrm{Z}^{a},\mathrm{Z}^{t}\right)\right)$ symbolizes the semantic alignment confidence. Ideally, representations from multi-modal sources are expected to be identical so as to eliminate the heterogeneous gap among modalities, i.e, $\operatorname{Tr}\left(\cos\left(\mathrm{Z}^{a},\mathrm{Z}^{t}\right)\right)/b=1$ . At the beginning of training, a higher temperature is set to impose higher penalties on implicit hard negative samples for better feature consistency. With the increase of iteration, the semantic alignment is improved, and a low temperature is employed to uncover potential positive pairs.

Input: Training data

\mathcal{D}=\{\mathbf{A},\mathbf{T}\}

, code length

c

, batchsize

b

, hyperparameters {

\alpha

\beta

}, maximum epoch

T

Output: Network parameters

\{\theta_{k},\epsilon_{k},\gamma_{k}\}

k\in\{a,t\}

for both the modalities.

1 Initialize the feature extraction module

\theta_{a}

and

\theta_{t}

with pretrained model, and the rest randomly;

2 for $t=1:T$ do

3 Randomly sample

O=[o_{1},o_{2},...,o_{b}]

from

\mathcal{D}

;

4 Extract features

\mathrm{F}^{a}=\mathrm{E}_{a}\left(\mathrm{A},\theta_{a}\right)

and

\mathrm{F}^{t}=\mathrm{E}_{t}\left(\mathrm{T},\theta_{t}\right)

;

5 Encode

\mathrm{Z}^{k}=\mathrm{E}_{k}\left(\mathrm{F}^{k},\epsilon_{k}\right)

k\in\{a,t\}

;

6 Decode

\mathrm{H}^{k}=\mathrm{D}_{k}\left(\mathrm{Z}^{k},\gamma_{k}\right)

k\in\{a,t\}

;

7 Compute loss according to E.q. (11);

8 Update

\{\theta_{k},\epsilon_{k},\gamma_{k}\}

k\in\{a,t\}

with SGD.

9 end for

return Network parameters

\{\theta_{k},\epsilon_{k},\gamma_{k}\}

k\in\{a,t\}

Algorithm 1 Optimization for CLSR

III-B3 Semantic Consistency Loss

Cross-modal retrieval aims to bridge the heterogeneous disparity among different modalities, necessitating the establishment of symmetric similarities, i.e., $S_{ij}=S_{ji}$ . This correlation can be easily satisfied under the supervised learning setting by leveraging the label information, while it needs to be explicitly constructed in the unsupervised scenario. Thus, the following loss function is designed to maintain semantic consistency:

\mathcal{L}_{sem}=\|cos(Z^{a},Z^{t})-cos(Z^{t},Z^{a})\|^{2}_{F},

(8)

where $\|.\|_{F}$ is the Frobenius norm.

III-B4 Reconstruction Loss

[31] has demonstrated that deep feature reconstruction can enhance the cross-modal correlation and reduce the modality gap. Thus, we introduce decoders for both modalities, as illustrated in Fig. 1. Following the extraction of the high-level features, $\mathrm{Z}^{a}$ and $\mathrm{Z}^{t}$ , we feed these two modality-specific features into the decoders, which can be written as follows:

\begin{aligned} &\mathrm{H}^{t}=\mathrm{D}_{a}\left(\mathrm{Z}^{a},\gamma_{a}\right)\in\mathbb{R}^{b\times d_{t}}\\ &\mathrm{H}^{a}=\mathrm{D}_{t}\left(\mathrm{Z}^{t},\gamma_{t}\right)\in\mathbb{R}^{b\times d_{a}}\end{aligned}.

(9)

Thus, the reconstruction loss of transforming one modality to the other can be measured by:

\mathcal{L}_{rec}=\|F^{t}-H^{t}\|_{F}^{2}+\|F^{a}-H^{a}\|_{F}^{2}.

(10)

Therefore, we can effectively leverage these two compatible attributes and subsequently explore the semantic connections among the information in various modalities to the fullest extent.

III-B5 Optimization

The optimization goal can be written as follows:

\begin{aligned} &\min\mathcal{L}=\mathcal{L}_{con}+\alpha\mathcal{L}_{sem}+\beta\mathcal{L}_{rec}\\ &\text{ s.t. }\tau=\tau_{0}\cdot\gamma^{\operatorname{Tr}\left(\cos\left(\mathrm{Z}^{a},\mathrm{Z}^{t}\right)\right)/b}\end{aligned}.

(11)

$\alpha$ and $\beta$ are two tunable parameters. The optimization is conducted with SGD. The overall training steps are detailed in Algorithm. 1.

IV Experiment

TABLE I: The R@k results of CLSR and baselines on AudioCaps and Clotho.

		AudioCaps			Clotho
Task	Method	R@1	R@5	R@10	R@1	R@5	R@10
	MOEE	26.6	59.3	73.5	7.2	22.1	33.2
	CE	27.6	60.5	74.7	7.0	22.7	34.6
	ARC	33.3	65.3	80.6	13.0	30.5	45.4
	ASE	38.8	71.5	83.1	13.4	36.1	49.3
A2T	CLSR	42.2	73.3	84.5	15.3	36.3	49.6
	MOEE	23.0	55.7	71.0	6.0	20.8	32.3
	CE	23.6	56.2	71.4	6.7	21.6	33.2
	ARC	29.3	60.2	79.3	13.1	28.2	45.1
	ASE	33.4	69.1	81.7	13.2	35.8	49.6
T2A	CLSR	34.1	70.0	83.7	13.4	36.2	50.3

IV-A Datasets

The effectiveness of CLSR is verified on two standard datasets, i.e., AudioCaps [32] and Clotho [33].

AudioCaps comprises approximately 50,000 audio clips, each lasting for 10 seconds, sourced from AudioSet [34]. Among these, 49,274 audio clips have been chosen for the training set, each accompanied by its respective textual description. Furthermore, the validation set comprises 494 audio clips, while the test set contains 957 audio clips. Each audio clip in both sets is paired with five corresponding captions.

Clotho contains audio clips ranging from 15 to 30 seconds, annotated by corresponding textual data. The training set comprises 3,839 audio clips, while both the validation and test sets include 1,045 audio clips each. Additionally, each audio clip is paired with five corresponding textual descriptions.

IV-B Experimental Setups

Following [22], the log Mel-spectrogram is computed using a Hanning window with 1024 points and a hop size of 320 points, resulting in 64 mel bins. The maximum training epoch is set to be 50, with Adam as the optimizer. The learning rate is set as $1e^{-4}$ with a decay rate of $1/10$ per 20 epochs. The initial temperature $\tau_{0}$ is equal to $0.07$ , the scaling factor $\gamma$ is set to 1.2, and the dimension of the output embedding is $1024$ . $\alpha$ and $\beta$ are set to be 1 and 0.1, respectively. Experiments are carried out using a single NVIDIA Tesla V100 GPU.

The evaluation metric employed is Recall at rank $k$ (R@k), which indicates whether the most relevant content to a query appears at the top rank $k$ . The R@k metric is scaled between 0 and 1, and higher values signify better performance, where the top-ranked retrieval result is more likely to be related to the query.

IV-C Evaluation

We compare CLSR with several baselines, i.e., MOEE [17], CE [18], ARC [19], and ASE [22]. It should be noticed that MOEE and CE are two frameworks used for text-video retrieval, and, following [35], we adapt it for text-audio retrieval. We evaluate CLSR and these baselines on AudioCaps and Clotho, and the experimental results including R@1, R@5 and R@10, are detailed in Table I. Based on the experimental results, it can be observed that:

•

CLSR consistently achieves superior R@k results in both Audio-to-Text and Text-to-Audio tasks across both datasets. CLSR considers enlarging the distance of negative pairs to enhance instance discrimination, contributing to more robust latent embedding. For contrastive learning, to alleviate the limitation of a small batch size, a reweighting strategy is imposed to adaptively adjust weight of different positive pairs.
•

In most cases, R@k results on Audio-to-Text task tend to surpass those for the Text-to-Audio task. This trend may be attributed to the rich semantic information in textual data compared to audio data.

IV-D Ablation Study

For CLSR, multi-modal contrastive learning, semantic consistency and the reconstruction loss are adopted for CMR representation learning. In this subsection, ablation study is conducted to analyze their effects. From CLSR, four variants are derived, i.e., CLSR-s, CLSR-t, CLSR-k and CLSR-m. CLSR-s removes the intra-modal contrastive loss, i.e., the third term and the fourth term in Eq. (6). CLSR-t drops the confidence-aware temperature control strategy. CLSR-k adopts no semantic consistency loss, which is the second term in E.q (11). CLSR-m discards the reconstruction loss. CLSR’s performance is assessed in comparison to these four variants on AudioCaps.

TABLE II: The Experiment Results of CLSR with four variants on AudioCaps.

		AudioCaps			Clotho
Task	Method	R@1	R@5	R@10	R@1	R@5	R@10
	CLSR-s	39.8	71.5	83.1	13.7	35.9	49.1
	CLSR-t	41.5	71.3	82.7	14.8	35.7	48.9
	CLSR-k	41.9	71.7	83.4	14.6	36.1	47.1
	CLSR-m	41.8	72.6	84.2	14.7	35.7	48.0
A2T	CLSR	42.2	73.3	84.5	15.3	36.3	49.6
	CLSR-s	33.5	69.5	81.7	13.3	35.9	48.4
	CLSR-t	33.9	68.9	81.6	13.2	36.1	48.6
	CLSR-k	33.6	69.2	81.9	13.1	35.9	48.6
	CLSR-m	33.8	69.7	82.3	13.2	35.8	48.0
T2A	CLSR	34.1	70.0	83.7	13.4	36.2	50.3

The R@k scores are listed in TABLE II. It can be shown that CLSR generally surpasses these variants. Concretely, CLSR-s achieves low R@k results, which demonstrates the effectiveness of the expanded intra-modal contrastive loss for learning robust latent representations. CLSR-t, CLSR-k achieve better results than CLSR-s, which are still inferior to CLSR. This demonstrate that the adaptive temperature scheme and semantic consistency contribute to improving the retrieval results. The performance of CLSR-m is satisfying, but still lower than CLSR, validating the efficacy of the reconstruction loss.

V Conclusion

In this paper, a novel CMR method, dubbed as Contrastive Latent Space Reconstruction (CLSR), is introduced to address the audio-text retrieval. CLSR is implemented in an unsupervised fashion, employing contrastive learning to dynamically increase the separation between negative pairs while reducing the gap between positive pairs. We expand the multi-modal contrastive loss by taking intra-modal separability into consideration, which improves the robustness of the learned representations. Besides, the temperature setting is adaptive along with semantic alignment, allowing CLSR to excavate latent correlations between samples. CLSR utilizes semantic consistency and modality reconstruction to increase the discrimination of sample features. Experimental results across two datasets illustrate that CLSR surpasses certain state-of-the-art cross-modal methods.

VI Acknowledgement

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd ([email protected]).

References

[1] J. Wei, X. Xu, Y. Yang, Y. Ji, Z. Wang, and H. T. Shen, “Universal weighting metric learning for cross-modal matching,” in Proc. CVPR, 2020, pp. 13 005–13 014.
[2] Y. Xin, D. Yang, and Y. Zou, “Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss,” in Proc. ICASSP, 2023, pp. 1–5.
[3] P. Hu, H. Zhu, J. Lin, D. Peng, Y.-P. Zhao, and X. Peng, “Unsupervised contrastive cross-modal hashing,” IEEE Trans. Pattern Anal. Mach. Intell., 2022, early access, doi:10.1109/TPAMI.2022.3177356.
[4] M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, and H. T. Shen, “Collective reconstructive embeddings for cross-modal hashing,” IEEE Trans. Image Process., vol. 28, no. 6, pp. 2770–2784, 2019.
[5] M. Li, S.-L. Huang, and L. Zhang, “A general framework for incomplete cross-modal retrieval with missing labels and missing modalities,” in Proc. ICASSP, 2022, pp. 4763–4767.
[6] P. Zhang, Y. Li, Z. Huang, and X. Xu, “Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval,” IEEE Trans. Multim., vol. 24, pp. 466–479, 2022.
[7] Z. Zeng, Y. Sun, and W. Mao, “MCCN: multimodal coordinated clustering network for large-scale cross-modal retrieval,” in Proc. ACM MM, 2021, pp. 5427–5435.
[8] G. Mikriukov, M. Ravanbakhsh, and B. Demir, “Unsupervised contrastive hashing for cross-modal retrieval in remote sensing,” in Proc. ICASSP, 2022, pp. 4463–4467.
[9] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for cross-modal matching,” in Proc. CVPR, 2013, pp. 2088–2095.
[10] K. Luo, C. Zhang, H. Li, X. Jia, and C. Chen, “Adaptive marginalized semantic hashing for unpaired cross-modal retrieval,” IEEE Trans. Multim., 2023, early access, doi:10.1109/TMM.2023.3245400.
[11] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon, “Large-scale content-based audio retrieval from text queries,” in Proc. ICMR, 2008, pp. 105–112.
[12] S. Ikawa and K. Kashino, “Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds.” in Proc. DCASE, 2018, pp. 59–63.
[13] B. Elizalde, S. Zarar, and B. Raj, “Cross modal audio search and retrieval with joint embeddings based on text and audio,” in Proc. ICASSP, 2019, pp. 4095–4099.
[14] A. M. Oncescu, A. S. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in Proc. Interspeech, 2021.
[15] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. PMLR, 2020, pp. 1597–1607.
[17] A. Miech, I. Laptev, and J. Sivic, “Learning a text-video embedding from incomplete and heterogeneous data,” arXiv preprint arXiv:1804.02516, 2018.
[18] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, “Use what you have: Video retrieval using representations from collaborative experts,” in Proc. BMVC, 2019.
[19] S. Lou, X. Xu, M. Wu, and K. Yu, “Audio-text retrieval in context,” in Proc. ICASSP, 2022, pp. 4793–4797.
[20] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 2880–2894, 2020.
[21] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classification,” arXiv preprint arXiv:1706.06905, 2017.
[22] X. Mei, X. Liu, J. Sun, M. D. Plumbley, and W. Wang, “On metric learning for audio-text cross-modal retrieval,” in Proc. Interspeech, 2022.
[23] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[24] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020.
[25] C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” in Proc. ECCV, 2022, pp. 668–684.
[26] Z. Huang, H. Chen, Z. Wen, C. Zhang, H. Li, B. Wang, and C. Chen, “Model-aware contrastive learning: Towards escaping the dilemmas,” in Proc. ICML, 2023.
[27] R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “Prodiff: Progressive fast diffusion model for high-quality text-to-speech,” in Proc. ACM MM, 2022, pp. 2595–2605.
[28] Y. Deng, H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Pmvc: Data augmentation-based prosody modeling for expressive voice conversion,” in Proc. ACM MM, 2023.
[29] H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Qi-tts: Questioning intonation control for emotional speech synthesis,” in Proc. ICASSP, 2023, pp. 1–5.
[30] X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Voice conversion with denoising diffusion probabilistic gan models,” in Proc. ADMA, 2023.
[31] D. Yang, D. Wu, W. Zhang, H. Zhang, B. Li, and W. Wang, “Deep semantic-alignment hashing for unsupervised cross-modal retrieval,” in Proc. ICMR, 2020, pp. 44–52.
[32] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proc. NACCL, 2019, pp. 119–132.
[33] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. ICASSP, 2020, pp. 736–740.
[34] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017, pp. 776–780.
[35] A. S. Koepke, A.-M. Oncescu, J. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries: A benchmark study,” IEEE Trans. Multim., vol. 25, pp. 2675–2685, 2023.