This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

410012

Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations

Zijun Long Hunan University2 Lushan S RdChangshaHunanChina [email protected] Kangheng Liang University of GlasgowUniversity AvenueGlasgowUnited KingdomG12 8QQ [email protected] Gerardo Aragon-Camarasa University of GlasgowUniversity AvenueGlasgowUnited KingdomG12 8QQ [email protected] Richard Mccreadie University of GlasgowUniversity AvenueGlasgowUnited KingdomG12 8QQ [email protected]  and  Paul Henderson University of GlasgowUniversity AvenueGlasgowUnited KingdomG12 8QQ [email protected]
(2025)
Abstract.

Interactive Text-to-Image Retrieval (I-TIR) has emerged as a transformative user interactive ways for applications in domains such as e-commerce and education. Yet, current methodologies predominantly depend on finetuned Multimodal Large Language Models (MLLMs), which face two critical limitations: (1) Finetuning imposes prohibitive computational overhead and long-term maintenance costs. (2) Finetuning narrows the pretrained knowledge distribution of MLLMs, reducing their adaptability to novel scenarios. These issues are exacerbated by the inherently dynamic nature of real-world I-TIR systems, where queries and image databases evolve in complexity and diversity, often deviating from static training distributions. To overcome these constraints, we propose Diffusion Augmented Retrieval (DAR), a paradigm-shifting framework that bypasses MLLM finetuning entirely. DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations. This dual-modality approach deciphers nuanced user intent more holistically, enabling precise alignment between textual queries and visually relevant images. Rigorous evaluations across four benchmarks reveal DAR’s dual strengths: (1) Matches state-of-the-art finetuned I-TIR models on straightforward queries without task-specific training. (2) Scalable Generalization: Surpasses finetuned baselines by 7.61% in Hits@10 (top-10 accuracy) under multi-turn conversational complexity, demonstrating robustness to intricate, distributionally shifted interactions. By eliminating finetuning dependencies and leveraging generative-augmented representations, DAR establishes a new trajectory for efficient, adaptive, and scalable cross-modal retrieval systems.

Interactive Text-to-image Retrieval, Conversational IR, Diffusion Augmented Retrieval.
copyright: acmlicensedjournalyear: 2025doi: XXXXXXX.XXXXXXXconference: ; July 13–18, 2025; PADOVA, ITALYisbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Novelty in information retrieval

1. Introduction

Refer to caption
Figure 1. Diffusion Conceptual Framework

Interactive Text-to-Image Retrieval (I-TIR) seeks to identify relevant images for a user through a turn-by-turn dialogue with a conversational agent. This approach allows users to progressively refine their queries with the agent’s guidance, facilitating the retrieval of highly specific results, even when initial queries are vague or simplistic (Levy et al., 2023), as shown in the top-left of Figure 1. This conversational paradigm is increasingly popular, as it is well-suited to content exploration use cases, such as fashion shopping or searching for long-tail content, where the agent can help the user refine their search (Levy et al., 2023; Wu et al., 2021; Saito et al., 2023; Wan et al., 2024; Yuan and Lam, 2021).

Current state-of-the-art solutions to I-TIR typically rely on finetuning Multimodal Large Language Models (MLLMs) for the retrieval task, aiming to bridge the domain gap between MLLM pretraining and retrieval objectives (Lee et al., 2024; Levy et al., 2023). However, we argue that this finetuning is not always needed. First, the one-to-one mapping between text and images that MLLM fine-tuning aims to establish is neither sufficient nor feasible for complex and diverse quires in I-TIR. Second, by restricting the pretrained distribution, finetuning compromises the broader generalizability that MLLMs acquire from pretraining, causing such models to underperform for I-TIR whenever they encounter dialogues outside the finetuning distribution. This motivates our key research question: How can we enhance the generalizability of I-TIR frameworks without additional training?

To address this research question, we begin by revisiting the motivations for fine-tuning MLLMs and examining its associated drawbacks. MLLM pretraining typically uses large-scale, noisy text-image pairs from internet or crowdsourced data to learn a joint embedding space. However, this approach often leads to sparse alignment, where text and visual representations of the same content may have differing embeddings. In contrast, retrieval tasks require a more exact, one-to-one mapping to maximize text-image similarity scores. Although finetuning on retrieval datasets partially addresses this gap, we argue that achieving such a high degree of alignment is inherently challenging in I-TIR. Textual queries often lack the fine-grained detail needed to describe target images. These limitations make it nearly impossible to establish consistent one-to-one mappings between diverse dialogue queries and target images. Indeed, the finetuned mappings remain confined to the training distribution and therefore lack generalizability.

This lack of generalization in finetuned MLLMs poses a particular obstacle to developing robust I-TIR solutions. Multi-turn interactions frequently generate lengthy, varied, and complex dialogues (Zhai et al., 2024), which are difficult for MLLMs to handle effectively when only pretraining or finetuning on limited I-TIR datasets is available. Indeed, I-TIR is more semantically difficult than text-only conversational search because the cross-modal information space is larger, and training data are scarcer (Levy et al., 2023; Murahari et al., 2020). For instance, with 10 dialogue turns and up to 100 potential questions per turn, the conversation space can reach 100 quintillion distinct trajectories, meaning practically sized training and validation datasets can only cover the scenarios space very sparsely. Consequently, once deployed, finetuned I-TIR approaches are likely to encounter queries and images that are poorly represented in any training set, leading to rapid performance degradation (Tejero-de Pablos, 2024). Hence, we argue that I-TIR approaches should explore how to better leverage zero-shot MLLMs, rather than rely on finetuning them.

In terms of efficiency, finetuning MLLMs at scale requires substantial computational resources and GPU memory, making frequent updates impractical for smaller organizations and research groups (Xu et al., 2024). For instance, finetuning MLLMs can demand over 300GB of GPU memory, exceeding the capacity of many standard GPU clusters and effectively rendering additional training infeasible. Consequently, there is a pressing need to explore more efficient strategies to harness the capabilities of advanced MLLMs for I-TIR.

Recent advances in diffusion-based generative models present a promising pathway for adapting pretrained MLLMs to retrieval tasks without finetuning. We argue that DMs provide valuable prior knowledge on the text-to-image mapping—knowledge that pretrained MLLMs do not fully capture and finetuning seeks to acquire. By leveraging DM-generated images, we address the domain gaps between pretraining and retrieval objectives without requiring additional training. Furthermore, generating multiple images as intermediate representations alleviates the challenge of establishing one-to-one text-image mappings, while preserving the cross-modal knowledge of MLLMs to enhance generality.

Building on this insight, we propose a new I-TIR framework, referred to as Diffusion Augmented Retrieval (DAR), illustrated in Figure 1. The core idea underpinning DAR is to produce multiple intermediate representations of the user’s information need, via LLM-based refinement of dialogue (Wu et al., 2023; Touvron et al., 2023; Kumar et al., 2024) and diffusion-based image generation (Saharia et al., 2022; Rombach et al., 2022). These generative components collectively imagine the user’s intent based on the conversation, and the images in the target corpus are then ranked according to their similarity to these imagined representations. This multi-faceted portrayal of the query provides a richer, more robust foundation than finetuned MLLMs, leading to more accurate identification of semantically and visually related images. Moreover, DAR is compatible with various LLMs and MLLMs and requires no finetuning (see Section 5.2), making it well-suited for integration with larger, more powerful MLLMs in the future.

Refer to caption
Figure 2. The overall architecture of our proposed framework DAR.

In summary, our proposed DAR framework offers the following key contributions:

  1. (1)

    Diffusion Augmented Retrieval (DAR) Framework: We introduce a novel I-TIR framework, DAR, which transfers prior text-to-image knowledge from DMs through image generation. This approach bridges the gap between MLLMs’ pretraining tasks and retrieval objectives without requiring finetuning.

  2. (2)

    Multi-Faceted Cross-Modal Representations: DAR generates multiple intermediate representations of the user’s information need using LLMs and DMs across interactive turns. By enriching and diversifying how queries are represented, DAR more effectively identifies semantically and visually relevant images.

  3. (3)

    Preserving MLLMs’ Generalizability in I-TIR: Our approach maintains the broad cross-modal knowledge captured by MLLMs, enabling strong zero-shot performance. By avoiding finetuning, DAR retains the adaptability and versatility of large pretrained models.

  4. (4)

    Comprehensive Empirical Validation: We rigorously evaluate DAR on four diverse I-TIR benchmarks, mainly in comparison to prior approaches that rely on finetuned MLLMs. For initial, simpler queries (first conversation turn), DAR achieves performance comparable to state-of-the-art finetuned models. However, as query complexity grows over multiple turns, DAR consistently surpasses finetuned and un-finetuned approaches by up to 4.22% and 7.61% in Hits@10, respectively.

2. Related work

2.1. Visual Generative Models

Before Diffusion Models (DMs), the most successful visual generative models were Generative Adversarial Networks (GANs) (Goodfellow et al., 2020, 2014; Creswell et al., 2018), which introduced adversarial training for image synthesis (Goodfellow et al., 2020). While GANs achieved notable success in tasks like image generation and style transfer (Goodfellow et al., 2020), they suffer from limitations such as mode collapse, training instability, and the need for carefully tuned architectures, making them less robust for diverse tasks (Gui et al., 2020).

DMs address these challenges by using a noise-adding and noise-removal process, enabling stable and high-quality generation (Croitoru et al., 2023). Initial approaches like Denoising Diffusion Probabilistic Models laid the groundwork (Ho et al., 2020), followed by advancements such as Stable Diffusion (Rombach et al., 2022) and Imagen (Saharia et al., 2022), which enhance efficiency and scalability (Croitoru et al., 2023; Saharia et al., 2022). Diffusion models excel in generating high-fidelity, diverse outputs without mode collapse and are adaptable to multimodal tasks like text-to-image generation (Ramesh et al., 2021). In this work, we aim to leverage the prior knowledge of DMs to bridge the domain gap between the pretraining tasks of Multimodal Large Language Models (MLLMs) and retrieval objectives, achieving superior zero-shot performance and better handling of unseen samples.

2.2. Interactive Text-to-Image Retrieval

Interactive text-to-image retrieval (I-TIR) overcomes the limitations of conventional single-turn methods by iteratively clarifying a user’s information need across multiple dialog turns. This iterative process is especially effective for image retrieval, as a single initial query often fails to capture the fine details of target images. By incorporating user feedback over successive turns, I-TIR progressively aligns with user preferences, improving both retrieval accuracy and user satisfaction. However, this flexibility also adds complexity; the representation of a user’s information need becomes more elaborate due to the diversity of multi-turn dialogs.

Recent studies show the potential of LLMs and MLLMs for I-TIR. For instance, ChatIR (Levy et al., 2023) employs LLMs to simulate dialogues between users and answer bots, compensating for the scarcity of specialized text-to-image datasets tailored to I-TIR. Although this approach opens a promising direction for research, it does not address the unique training hurdles posed by I-TIR—specifically, handling highly diverse dialogue inputs. PlugIR (Lee et al., 2024) further advances I-TIR by improving the diversity of top-k results using kk-means clustering to group similar candidate images and identify representative exemplars. However, this technique incurs additional computational overhead. Moreover, ChatIR and PlugIR, as well as other I-TIR strategies, rely on fine-tuning on small, curated datasets to achieve good results on a specific benchmark (Zhu et al., 2024; Li et al., 2024; Yoon et al., 2024), limiting their ability to generalize to the broad range of real-world dialogues. Consequently, they often fail on out-of-distribution queries, leading to diminished performance in practical settings, as shown in Section 4.3.

To address these challenges, we propose the DAR framework, which prioritizes zero-shot I-TIR performance. By avoiding dataset-specific fine-tuning altogether, DAR avoids the reduced distribution space introduced by smaller, finetuned datasets, thereby achieving superior generalizability.

3. DAR for Interactive Text-to-Image Retrieval

In this section, we present the Diffusion Augmented Retrieval Framework (DAR), illustrated in Figure 2, which comprises three main steps: dialogue reformulation, imagination (image generation), and matching.

To bridge the domain gap between the pretraining tasks of multimodal large language models (MLLMs) and Interactive Text-to-Image Retrieval (I-TIR) without additional training, DAR employs two generative models to imagine user intentions and generate intermediate representations for retrieval:

  • Large Language Model (LLM): An LLM is utilized to adapt the dialogue context, ensuring it closely aligns with the retrieval model’s input requirements and user’s intent. This adaptation enhances the relevance of retrieval results by reducing ambiguities in dialogues.

  • Diffusion Model (DM): DMs provide valuable prior knowledge about text-to-image mappings—information that un-finetuned MLLMs lack. By generating multiple images based on LLM refined prompts, DMs create multifaceted representations of the user’s intent, thereby bridging the domain gap and eliminating the need for fine-tuning MLLMs.

The integration of these intermediate representations offers a richer and more robust foundation for bridging the text-to-image domain gap than finetuned MLLMs, leading to more accurate identification of semantically and visually related images.

In the following sections, we first provide background on the settings of I-TIR in Section 3.1. Next, we discuss the first step of DAR, dialog reformulation in two separate pipelines, in Section 3.2, illustrating how they enhance both DM generation and refine dialogues. We then introduce our main contributions in Section 3.3, focusing on the core concept of diffusion augmented multi-faceted generation. Finally, we describe the detailed retrieval procedure in Section 3.4.

3.1. Preliminary

Interactive text-to-image retrieval is formulated as a multi-turn task that begins with an initial user-provided description, D0D_{0}. The objective is to identify a target image through an iterative dialogue between the user and the retrieval framework. At each turn tt, the retrieval framework generates a question QtQ_{t} to clarify the search, and the user responds with an answer AtA_{t}. This interaction updates the dialogue context Ct=(D0,Q1,A1,,Qt,At)C_{t}=(D_{0},Q_{1},A_{1},\ldots,Q_{t},A_{t}), which is processed—such as by concatenating all textual elements—to form a unified search query StS_{t} for that round. The retrieval framework then matches images II\in\mathcal{I} in the image database against StS_{t}, ranking them based on a similarity score s(I,St)s(I,S_{t}). This process iterates until the target image II^{*} is successfully retrieved or the maximum number of turns is reached. Formally, this process can be defined as:

I=argmaxIs(I,ST)I^{*}=\arg\max_{I\in\mathcal{I}}s(I,S_{T})

where STS_{T} is the final search query after TT dialogue turns.

3.2. Dialog Context Aware Reformulation

While multi-turn interactions help capture user intent, raw dialogue data can introduce noise and complexity that degrades retrieval performance. Both encoders and diffusion models struggle with lengthy or ambiguous dialogue context, particularly because DMs have limited capacity for long or complex descriptions. Nevertheless, the quality of images generated by DMs is crucial for DAR to achieve superior performance. Moreover, discrepancies between the training distributions of encoders and DMs make it difficult to generate images that accurately reflect user intentions. To address these challenges, we propose two targeted approaches: refining the dialogue for textual representations used in retrieval; and optimizing the prompts for the DM generation process:

  1. (1)

    Dialogue Context Aware Reformulation: This pipeline adapts the dialogue context to better align with the input expectations of encoders and the user’s intent. Instead of directly using the raw dialogue context Ct={D0,Q1,A1,,Qt,At}C_{t}=\{D_{0},Q_{1},A_{1},\ldots,Q_{t},A_{t}\} as textual representations, we follow a multi-step process to refine the input.

    1. (a)

      Summarizing the Dialogue: We first ask an LLM to summarize the entire dialogue context CtC_{t}, providing a coherent and concise overview of the conversation up to turn tt.

    2. (b)

      Structuring the Input: The summarized dialogue is then reformulated into a specific format that clearly distinguishes the initial query (D0D_{0}) from the subsequent elaborations in the dialogue. This ensures that the LLM understands the progression of the conversation and the relevance of each turn.

    3. (c)

      Generating the Refined Query: Using this structured input, we prompt the LLM to generate a refined query StS_{t} that adheres to the encoders’ input distribution. This ensures the generated query captures the user’s intent in a way that facilitates accurate retrieval.

    The reformulation process can be expressed as:

    St=1(Ct)S_{t}=\mathcal{R}_{1}(C_{t})

    where 1\mathcal{R}_{1} denotes the reformulation function utilizing LLMs. An example prompt for 1\mathcal{R}_{1} is: "The reconstructed [New Query] should be concise and in an appropriate format to retrieve a target image from a pool of candidate images." This approach allows the dialogue context to be transformed into a more structured and relevant form for the retrieval pipeline, optimizing the alignment between user intent and the model’s output.

  2. (2)

    Diffusion Prompt Reformulation: This pipeline generates multiple prompts Pt,kP_{t,k} for use by the subsequent diffusion models based on the reformulated dialogue StS_{t}. By producing diverse prompts, we ensure that the generated images align with the diffusion model’s training distribution, capturing various linguistic patterns and semantic nuances. The reformulation process follows these steps:

    1. (a)

      Structuring the Prompt Template: We begin by structuring a prompt template that captures key elements from the reformulated dialogue StS_{t}, including the primary subject, setting, and important details. This structured format helps guide the diffusion model to generate images that are semantically coherent with the user’s intent.

    2. (b)

      Generating Diverse Prompts: Using the structured template, we generate multiple distinct prompts Pt,kP_{t,k} by varying linguistic patterns, modifiers, and details, ensuring a variety of interpretations that reflect the full scope of the dialogue. This diversity helps cover different possible details in the image generation process.

    3. (c)

      Adapting to the DM’s Distribution: The generated prompts are further adjusted to align with the diffusion model’s training distribution. This adaptation ensures that the prompts match the model’s expectations and improve the relevance of the generated images.

    The reformulation process is expressed as:

    Pt,k=2(St,k)for k=1,2,,KP_{t,k}=\mathcal{R}_{2}(S_{t},k)\quad\text{for }k=1,2,\ldots,K

    where 2\mathcal{R}_{2} denotes the prompt generation function, and KK is the number of prompts generated per turn. An example template prompt for 2\mathcal{R}_{2} is: “[Adjective] [Primary Subject] in [Setting], [Key Details]. Style: photorealistic."

    By generating diverse prompts, this approach ensures that the diffusion models produce images that act as multiple intermediate representations of the user’s information need, which are both semantically rich and better aligned with the user’s query.

Overall, since we focus on zero-shot scenarios where retrieval models are not finetuned on a particular domain or dataset, reformulating the dialog can supply additional context or details. This helps bridge the semantic gap between the user’s language and the system’s learned representation space, boosting performance without further training. Code for these reformulation pipelines is available at https://anonymous.4open.science/r/Diffusion-Augmented-Retrieval-7EF1/README.md.

Algorithm 1 Overall Retrieval Process of DAR
1:Initial user description D0D_{0}
2:Image pool ={I1,I2,,IN}\mathcal{I}=\{I_{1},I_{2},\ldots,I_{N}\}
3:Maximum number of turns TT
4:Retrieved target image II^{*}
5:Initialize dialogue context C0={D0}C_{0}=\{D_{0}\}
6:for turn t=1t=1 to TT do
7:     System Inquiry: Generate question QtQ_{t} based on Ct1C_{t-1}
8:     User Response: Receive answer AtA_{t} from the user
9:     Update dialogue context: Ct=Ct1{Qt,At}C_{t}=C_{t-1}\cup\{Q_{t},A_{t}\}
10:     Dialogue Context Aware Reformulation:
11:     St1(Ct)S_{t}\leftarrow\mathcal{R}_{1}(C_{t}) \triangleright Transform CtC_{t} into caption-style description
12:     Diffusion Prompt Reformulation:
13:     for each prompt index k=1k=1 to KK do
14:         Generate prompt Pt,k2(St,k)P_{t,k}\leftarrow\mathcal{R}_{2}(S_{t},k)
15:         Generate synthetic image I^t,kDiffusionModel(Pt,k)\hat{I}_{t,k}\leftarrow\text{DiffusionModel}(P_{t,k})
16:     end for
17:     Feature Fusion:
18:     FtαE(St)+β(k=1KE(I^t,k))F_{t}\leftarrow\alpha\cdot E(S_{t})+\beta\cdot\left(\sum_{k=1}^{K}E(\hat{I}_{t,k})\right)
19:     Similarity Computation and Ranking:
20:     for each image II\in\mathcal{I} do
21:         Compute similarity score s(I,Ft)Similarity(E(I),Ft)s(I,F_{t})\leftarrow\text{Similarity}(E(I),F_{t})
22:     end for
23:     Rank images in \mathcal{I} based on s(I,Ft)s(I,F_{t}) in descending order
24:     Retrieve Top Image: ItargmaxIs(I,Ft)I^{*}_{t}\leftarrow\arg\max_{I\in\mathcal{I}}s(I,F_{t})
25:     if  ItI^{*}_{t} is satisfactory then
26:         Terminate Retrieval: Set I=ItI^{*}=I^{*}_{t}
27:         Exit the loop
28:     end if
29:end for
30:Final Retrieval: I=argmaxIs(I,FT)I^{*}=\arg\max_{I\in\mathcal{I}}s(I,F_{T})

3.3. Diffusion Augmented Multi-Faceted Generation

Following the reformulation step, we obtain a refined textual dialog for retrieval (Section 3.2) and multiple prompts that capture different aspects of the user’s intent. We now proceed to the imagine step, where these prompts are used to generate images that augment the retrieval process. We term this approach Diffusion Augmented Retrieval (DAR).

The core idea of DAR involves utilizing DMs to generate synthetic images that serve as multiple intermediate representations of the user’s information needs, thereby enhancing the retrieval process. DMs excel at producing high-quality, visually realistic images from textual descriptions, enabling them to closely align with the user’s intent. By leveraging the prior visual knowledge embedded in DMs through image generation, DAR establishes many-to-one mappings between queries and target images instead of one-to-one mappings. This multi-faceted representation of queries addresses the challenges posed by incomplete or ambiguous textual inputs, offering a richer and more diverse set of visual representations for retrieval. Consequently, DAR achieves robust zero-shot performance, particularly in I-TIR scenarios where labeled data is scarce.

Specifically, for text-guided diffusion generation, the reverse process is conditioned on a text embedding 𝐭d\mathbf{t}_{d} from the diffusion text encoder:

pθ(xt1xt,𝐭d)=𝒩(xt1;𝝁θ(xt,t,𝐭d),Σ(t)),p_{\theta}(x_{t-1}\mid x_{t},\mathbf{t}_{d})=\mathcal{N}\Bigl{(}x_{t-1};\bm{\mu}_{\theta}\bigl{(}x_{t},t,\mathbf{t}_{d}\bigr{)},\,\Sigma(t)\Bigr{)},

where the function 𝝁θ(xt,t,𝐭d)\bm{\mu}_{\theta}(x_{t},t,\mathbf{t}_{d}) represents the learned denoiser, which predicts the mean of the posterior distribution for xt1x_{t-1} given the noisy input xtx_{t}, timestep tt, and additional context 𝐭d\mathbf{t}_{d}.

Let

{Pt,k}k=1K\bigl{\{}P_{t,k}\bigr{\}}_{k=1}^{K}

be the set of KK diffusion-ready prompts produced by the reformulation process. We feed each prompt Pt,kP_{t,k} into the diffusion model D()D(\cdot) to generate a corresponding set of images:

{I^t,k}k=1K={D(Pt,k)}k=1K.\bigl{\{}\hat{I}_{t,k}\bigr{\}}_{k=1}^{K}=\Bigl{\{}D\bigl{(}P_{t,k}\bigr{)}\Bigr{\}}_{k=1}^{K}.

Each generated image It,kI_{t,k} reflects one possible interpretation of the user’s intent based on the prompt Pt,kP_{t,k}. By leveraging the large-scale pretraining of the diffusion model and generating multiple images per turn, DAR captures diverse visual features relevant to the query.

3.4. Retrieval Process

3.4.1. DAR Encoding and Feature Fusion

In the DAR framework, an MLLM is employed to encode both the reformulated textual dialogue StS_{t} and the generated images {It,k}k=1K\{I_{t,k}\}_{k=1}^{K}. Although MLLMs may be based on unified or two-tower architectures for handling text and images, their exact design does not affect the overall flow of DAR, as we only utilize them as encoders. Consequently, we do not delve into internal implementation details. Instead, we focus on four key steps: dialog encoding, generated image encoding, candidate image encoding, and feature fusion.

  1. (1)

    Dialog Encoding. Let Et()E_{t}(\cdot) denote the text encoder of the MLLM for dialogue inputs. Given the reformulated textual dialogue StS_{t} at turn tt, we obtain its embedding as:

    (1) 𝐭t=Et(St),\mathbf{t}_{t}\;=\;E_{t}(S_{t}),

    where 𝐭td\mathbf{t}_{t}\in\mathbb{R}^{d} is the resulting textual embedding and dd is the embedding dimensionality.

  2. (2)

    Generated Image Encoding. Let Ev()E_{v}(\cdot) denote the image encoder of the MLLM. For each generated image I^t,k\hat{I}_{t,k}, where k=1,2,,Kk=1,2,\ldots,K, its embedding is:

    (2) 𝐢t,k=Ev(I^t,k),\mathbf{i}_{t,k}\;=\;E_{v}\bigl{(}\hat{I}_{t,k}\bigr{)},

    so that {𝐢t,k}k=1K\{\mathbf{i}_{t,k}\}_{k=1}^{K} represents the set of embeddings corresponding to the KK synthetic images generated at turn tt.

  3. (3)

    Image Encoding for Candidate Images. Let ={I1,I2,,IN}\mathcal{I}=\{I_{1},I_{2},\ldots,I_{N}\} be the set of NN candidate images in the database. We encode each candidate image IjI_{j} using the same image encoder Ev()E_{v}(\cdot) that we used for generated images:

    (3) 𝐢j=Ev(Ij),j=1,2,,N.\mathbf{i}_{j}\;=\;E_{v}(I_{j}),\quad j=1,2,\ldots,N.

    This ensures all images, whether generated or from the database, reside in the same embedding space.

  4. (4)

    Feature Fusion. To create a multi-faceted feature representation FtF_{t} at turn tt, we integrate the textual embedding 𝐭t\mathbf{t}_{t} the aggregated embeddings of the generated images {𝐢t,k}\{\mathbf{i}_{t,k}\}. We introduce weighting factors α\alpha and β\beta to balance the contributions of the textual and visual embeddings, respectively. Formally:

    (4) Ft=α𝐭t+β(k=1K𝐢t,k),α+β=1.F_{t}\;=\;\alpha\,\mathbf{t}_{t}\;+\;\beta\,\Bigl{(}\sum_{k=1}^{K}\mathbf{i}_{t,k}\Bigr{)},\quad\alpha+\beta=1.

    Here, FtdF_{t}\in\mathbb{R}^{d} is the fused feature vector that captures both multi-faceted linguistic and visual semantics. By adjusting α\alpha and β\beta, one can control the relative influence of textual and visual information in the final representations.

The above steps ensure that DAR leverages the complementary strengths of MLLMs and generated contents, thereby enhancing retrieval accuracy in zero-shot settings without requiring further model training.

3.4.2. Matching.

The matching process in DAR leverages the multi-faceted feature representation FtF_{t} to identify the most relevant images from the image pool ={I1,I2,,IN}\mathcal{I}=\{I_{1},I_{2},\ldots,I_{N}\}. This process involves the following steps:

  1. (1)

    Similarity Computation: For each candidate image iji_{j}\in\mathcal{I}, we compute the cosine similarity score s(I,Ft)s(I,F_{t}) between 𝐢\mathbf{i} and the fused feature FtF_{t} as follows:

    s(𝐈,𝐅t)=𝐢𝐣𝐅t𝐢𝐣𝐅t.s\bigl{(}\mathbf{I},\mathbf{F}_{t}\bigr{)}=\frac{\mathbf{i_{j}}\cdot\mathbf{F}_{t}}{\lVert\mathbf{i_{j}}\rVert\,\lVert\mathbf{F}_{t}\rVert}.
  2. (2)

    Ranking: We then rank the images in descending order of their similarity scores and retrieve the top-kk results:

    (5) [I1,I2,,Ik]\displaystyle[I^{*}_{1},I^{*}_{2},\dots,I^{*}_{k}] =topIks(I,Ft),\displaystyle=\mathrm{top}^{k}_{I\in\mathcal{I}}\,s(I,F_{t}),

    where topk()\mathrm{top}^{k}(\cdot) returns the kk highest-scoring images. By comparing the fused embedding FtF_{t} with each candidate image embedding 𝐢\mathbf{i}, this step identifies those images that best match both the refined query and the diffusion-generated features.

  3. (3)

    Iteration: Finally, the retrieval process iterates with new dialogue turns, updating FtF_{t} at each turn tt, until the maximum number of turns TT is reached or the target image II^{*} is successfully retrieved. At the final turn TT, the retrieval result is given by

    (6) I=argmaxIs(I,FT).I^{*}\;=\arg\max_{I\in\mathcal{I}}\,s(I,F_{T}).

The QA turn generation follows the methods outlined in (Levy et al., 2023), with further details provided in Section 4.1. The complete retrieval process of DAR is given in Algorithm 1.

Refer to caption
Figure 3. The experimental results for four evaluated benchmarks. Note that for Hits@10, a higher value is better. It is an accumulative metric because we cease to use additional dialogues once the image attains a top-k rank. reduction in performance.

4. Experimental Results

4.1. Experimental Settings

We evaluate our proposed DAR framework in interactive retrieval settings using four well-established benchmarks. Specifically, we employ the validation set of Visual Dialog (VisDial) (Das et al., 2019) dataset and three dialog datasets111Available at: https://github.com/levymsn/ChatIR constructed by (Levy et al., 2023), named ChatGPT_BLIP2, HUMAN_BLIP2, and Flan-Alpaca-XXL_BLIP2. For the latter three dialog datasets, the first part of each name indicates the questioner model used to generate the dataset, and the second part refers to the answer model. Note that human refers to professional human assessors. Further details on these three datasets can be found in (Levy et al., 2023). All four dialog datasets consist of 2,064 dialogues, each with 10 dialogue turns.

Following previous work in the interactive cross-modal retrieval domain, we use BLIP (Li et al., 2022) as our default MLLM encoder, given its established zero-shot performance and to ensure a fair comparison with prior studies. Unless otherwise specified, we report Hits@10 as our primary evaluation metric.

We adopt the Stable Diffusion 3 model (SD3) (Esser et al., 2024) as the default diffusion model (DM) in our experiments. Additionally, BLIP-3 (Xue et al., 2024) is employed as the Large Language Model (LLM) for reformulating textual content. Using specially designed prompts, BLIP-3 operates under two distinct reformulation pipelines: one for adapting the dialogue and another for producing aligned prompts for the DM, as described in Section 3.2. We empirically set the weighting factors for textual and visual content to 0.70.7 and 0.30.3, respectively, for the first two dialogue turns. Starting from turn 3, the weights are both set to 0.50.5. This strategy balances the influence of textual vs. visual embeddings as the dialogue becomes more dynamic. In all experiments, we fix the number of generated images per turn at three. Code for all experiments is available at https://anonymous.4open.science/r/Diffusion-Augmented-Retrieval-7EF1/README.md.

Baselines and DAR variants. We compare our proposed DAR framework with three baselines, namely ChatIR, ZS, and COCOFT:

  • ChatIR: We adopt the BLIP-based variant of ChatIR (Levy et al., 2023) as our baseline. Since it is finetuned on the Visual Dialog dataset, ChatIR represents a finetuned model that helps us compare both the effectiveness and efficiency benefits of DAR.

  • ZS: The zero-shot (ZS) baseline uses BLIP with its original, publicly available pretrained weights 222Available at: https://github.com/salesforce/BLIP, without any fine-tuning on retrieval datasets. This setup captures the common scenario where researchers or practitioners directly rely on publicly released weights without task-specific training.

  • COCOFT: The COCOFT baseline denotes the BLIP model finetuned on the popular MSCOCO (Lin et al., 2014) retrieval dataset. Although it benefits from MSCOCO-specific training, it remains non-finetuned for interactive text-to-image retrieval (I-TIR). Consequently, dialogues including questions and answers are applied directly as queries for retrieval. This serves as an indicator of how previously finetuned single-turn retrieval models perform in an interactive retrieval setting.

We integrate each of the three baseline encoders into our DAR framework to evaluate improvements across different scenarios, denoting them as DAR_xxx. For example, DAR_chatir adopts the BLIP-based ChatIR model as the encoder. Notably, comparisons should primarily focus on the corresponding pairs—such as DAR_chatir versus ChatIR—whose lines are represented by the same colors in Figure 3. This is because these pairs are initialized from the same pre-trained weights, ensuring a fair comparison that shows how the addition of DAR affects performance.

4.2. Zero-Shot I-TIR Performance

We first investigate how our proposed DAR framework performs under zero-shot I-TIR conditions. Specifically, BLIP_zs denotes the baseline where BLIP is used with its original, publicly available pretrained weights—without any fine-tuning on retrieval datasets. Our DAR_zs setup similarly employs pretrianed BLIP as the encoder, thereby reflecting a common scenario in which researchers or practitioners rely solely on publicly released weights.

Each subfigure (a–d) in Figures 3 presents the performance of DAR variants and baseline models across the four benchmarks outlined in Section 4.1, evaluated over multiple conversational turns.

In this section, we compare the dashed blue lines (baseline BLIP_zs) with the solid blue line (DAR_zs) across the four subfigures (a–d) in Figures 3. We can tell that DAR_zs consistently outperforms BLIP_zs across all four evaluated benchmarks. Notably, the FLAN_BLIP2 dataset exhibits the largest improvement, with DAR_zs achieving a 7.61% increase in Hits@10 after 10 dialog rounds. Indeed, the performance gap between DAR_zs and BLIP_zs consistently widens as the dialogue extends over multiple turns.

These findings demonstrate that DAR effectively boosts zero-shot performance in interactive retrieval tasks where our “query" is derived from a complex and diverse dialog, and it can do this without incurring additional tuning overhead.

4.3. Robustness to Complex and Diverse Dialogue Queries

Since our proposed DAR framework is aimed at tackling the challenges of complex and diverse dialogue queries in I-TIR, evaluating its robustness under such conditions is important. Therefore, we focus our analysis on the most challenging benchmark among the four we evaluated—specifically the one where the best-performing model achieves the lowest Hits@10 score. In this section, we analyze the solid green lines (DAR_chatir) as the best-performing model, comparing them with the dashed lines representing the three baselines across the four subfigures (a–d) in Figures 3.

We observe a similar performance trend across three of the benchmarks, whereas FLAN_BLIP2 differs considerably (see Figure 3.d). On the FLAN_BLIP2 benchmark, three baseline models (dashed lines) without DAR tend to see their performance plateau around the fifth dialogue turn, indicating they cannot effectively leverage additional complexities introduced later in the conversation. In contrast, our DAR framework continues to improve after turn 5, showing no signs of saturation. Additionally, even our best-performing model, (DAR_chatir), achieves the lowest Hits@10 on the FLAN_BLIP2 benchmark. We attribute this to the highly diverse dialogues in FLAN_BLIP2 compared to the other datasets, making it an important test case for evaluating out-of-distribution performance.

This suggests that DAR can better exploit the extra information presented in later dialogue turns and generate images more closely aligned with the user’s target. Therefore, DAR demonstrates greater robustness to complex and diverse dialogue queries, fulfilling the requirements of I-TIR.

4.4. Gains and losses of DAR compared to finetuned models

In this section, we investigate both the effectiveness and efficiency of the zero-shot, non-finetuned version of our proposed DAR framework, DAR_zs, by comparing it against the finetuned ChatIR baseline. Thus, we compare the solid blue line (DAR_zs) with the dash green line (ChatIR) across the four subfigures (a–d) in Figures 3. Overall, DAR_zs demonstrates competitive performance on all evaluated benchmarks.

As analyzed in Section 4.3, FLAN_BLIP2 is the most challenging dataset in our experiments, owing to its particularly complex and dynamic interactive dialogues. Notably, DAR_zs (solid blue line) outperforms the finetuned ChatIR (dash green line) model by 4.22% in Hits@10 at turn 10 (Figure 3.d), supporting our hypothesis that fine-tuning MLLMs on limited I-TIR datasets can undermine their ability to handle out-of-distribution queries. Finetuned models like ChatIR perform best on the VisDial dataset (see Figure 3.a), given their extensive fine-tuning on 123k VisDial training samples. Even in this favorable setting, the worst-case of non-finetuned DAR model, DAR_zs, lags behind ChatIR by only 3.01% in Hits@10, all while saving 100% of the fine-tuning time.

This suggests that ChatIR’s narrower finetuned distribution makes it more prone to out-of-distribution errors during inference, leading to underperformance. In contrast, DAR excels in these failure scenarios and remains competitive even on ChatIR’s own finetuned dataset.

4.5. Compatibility with finetuned Models

Our proposed DAR can also be combined with various encoders, including those already finetuned for dialogue-based text. In this context, the main comparison is between DAR_chatir and ChatIR, where DAR_chatir employs the finetuned ChatIR model as its encoder. Thus, we compare the solid green line (DAR_chatir) with the dash green line (ChatIR) across the four subfigures (a–d) in Figures 3.

Notably, as shown in Figures 3, DAR_chatir consistently outperforms ChatIR without additional training. Because ChatIR is finetuned on the VisDial dataset, it holds an obvious advantage in that benchmark. Building upon this strong baseline, DAR_chatir still achieves a 2.42% improvement in Hits@10 over ChatIR on the VisDial dataset.

More importantly, as discussed earlier, finetuned models often struggle with out-of-distribution data in interactive retrieval due to the inherently dynamic nature of multi-turn dialogues. This challenge arises in the other three benchmark datasets we evaluate. Consequently, DAR_chatir exhibits even greater performance gains in these scenarios, achieving up to a 9.4% increase in Hits@10 compared to ChatIR.

These findings support our claim that DAR can enhance zero-shot retrieval performance without further training—particularly in the face of dynamic dialogues in interactive retrieval—even when the underlying encoder is finetuned on a specific dataset.

4.6. Does Fine-Tuning on Single-Turn Retrieval Datasets Improve Performance?

Given that single-turn and interactive text-to-image retrieval share some similarities at turn 0, this section investigates whether fine-tuning on a single-turn retrieval dataset—specifically, the widely used MSCOCO dataset (Lin et al., 2014)—enhances performance in interactive retrieval tasks.

As shown in all four subfigures of Figure 3, the benefit of fine-tuning on a single-turn retrieval dataset is relatively limited. While finetuned models such as BLIP_cocoft (dash red lines) and DAR_cocoft (solid red lines) exhibit noticeable performance improvements at earlier turns compared to their zero-shot counterparts (BLIP_zs and DAR_zs), this advantage diminishes as the dialogue progresses. By turn 8, the zero-shot model DAR_zs even surpasses the COCO finetuned model DAR_cocoft in performance, as illustrated in Figure 3.d.

These findings show that conventional finetuning strategies on single-turn datasets are insufficient to address the increasing query diversity and complexity inherent in interactive text-to-image retrieval. This underscores the limitations of fine-tuning-based approaches in handling evolving multi-turn interactions. In contrast, our proposed DAR framework effectively bridges this research gap by enhancing zero-shot performance without requiring additional fine-tuning, making it a more scalable and adaptive solution for real-world retrieval scenarios.

Refer to caption
Figure 4. The examples of generated images in DAR.

5. Analysis

5.1. Qualitative Analysis of DAR-Generated Images

Beyond the numerical results presented in Section 4, we conduct an in-depth analysis of the images generated by DAR and uncover several key insights. As illustrated in Figure 4, the initial generated images tend to lack fine details and are easily distinguishable as synthetic. For instance, as the dialogue progresses and additional contextual information is incorporated, the generated images become increasingly photorealistic, with details aligning more closely with the target image—such as the accurate color of the player’s clothing. A similar trend is observed in the second-row example, where the diffusion model dynamically adjusts clothing colors based on dialogue updates, ultimately enhancing retrieval accuracy.

These findings highlight DAR’s ability to iteratively refine image generation based on evolving dialogue cues, enabling more precise semantic alignment with the target image. This adaptive generation process not only improves retrieval performance but also demonstrates the potential of integrating generative models into cross-modal retrieval frameworks.

5.2. Compatibility of DAR with Different Encoders and Generators

To assess the compatibility of DAR with different MLLMs as encoders and DMs as visual generators, we evaluate its performance using two widely adopted MLLMs—CLIP (Radford et al., 2021) and BEiT-3 (Wang et al., 2022)—as encoders, as well as Stable Diffusion v2-1 (Rombach et al., 2022) as an alternative visual generator.

The CLIP model, constrained by its maximum input length of 77 tokens, struggles to handle complex and lengthy dialogue-based queries compared to simpler caption-based queries in single-turn cross-modal retrieval tasks. Consequently, CLIP achieves a peak performance of 56.64% Hits@10 on the Visual Dialog benchmark. Despite this limitation, incorporating CLIP as the encoder within DAR still yields an improvement of 5.31% Hits@10 after 10 dialogue turns, demonstrating the robustness of our framework even with models that have constrained input capacity.

BEiT-3, a stronger encoder than CLIP and BLIP models, further enhances performance. When employed as the encoder within DAR, the framework achieves a 6.28% improvement in Hits@10 after 10 turns on the Visual Dialog benchmark, compared to the standalone pre-trained BEiT-3 model, highlighting the adaptability of DAR in leveraging stronger backbone encoders.

In addition to evaluating different encoders, we also assess the impact of using Stable Diffusion v2-1 as the visual generator. While DAR continues to yield notable improvements, the performance gain is slightly lower compared to using Stable Diffusion 3 (SD3), aligning with SD3’s superior generative capabilities. Specifically, DAR achieves a 6.37% Hits@10 with Stable Diffusion v2-1, compared to 7.61% Hits@10 with SD3.

These results demonstrate that DAR is flexible and compatible with various encoder and generation models. Regardless of the choice of encoder or visual generator, DAR consistently enhances retrieval performance, reinforcing its generalizability as an effective framework for zero-shot cross-modal retrieval.

5.3. Impact of the Number of Generated Images on Performance

In our evaluation, even when generating only a single image within our framework, we observe a 6.43% improvement in Hits@10 on the FLAN_BLIP2 benchmark compared to the ChatIR baseline. When increasing the number of generated images to three, performance further improves to 7.61%. However, beyond this point, we observe diminishing returns, with Hits@10 reaching saturation as the number of generated images continues to increase.

Considering the trade-off between effectiveness and computational efficiency, we set three generated images as the default configuration in DAR to achieve an balance between performance and inference cost.

5.4. Generation Overhead

While DAR eliminates the need for expensive fine-tuning, it introduces additional inference-time overhead due to the reformulation and generation of visual content. However, this trade-off is small compared with the benefits in retrieval performance and adaptability. In our experiments using a single Nvidia RTX 4090 GPU, the image generation process takes approximately 5 seconds, while the query reformulation step incurs only 0.5 seconds of additional processing time.

To put this into perspective, ChatGPT-o1 and other ‘reasoning’ LLMs use additional reasoning steps, which enhances response quality at the cost of increased inference time (often exceeding 10 seconds). Analogously, DAR achieves substantial gains in zero-shot retrieval while entirely eliminating training costs, making it an efficient and scalable alternative. Consequently, the modest inference overhead is a worthwhile trade-off, particularly in applications where adaptability and retrieval quality are crucial.

6. Conclusion

In this work, we introduce DAR, a novel framework that eliminates the need for fine-tuning multimodal large language models (MLLMs) for Interactive Text-to-Image Retrieval (I-TIR), thereby preserving their generalizability. Extensive experiments show that DAR performs competitively with existing I-TIR models, whether they are fine-tuned or pretrained. Our analysis reveals that fine-tuning MLLMs on limited retrieval datasets compromises their ability to handle out-of-distribution queries. Furthermore, our results illustrate that generating multiple intermediate, multi-faceted representations of user intent enables a many-to-one mapping between text and images, effectively accommodating the diverse and complex queries characteristic of I-TIR. This work highlights the potential of diffusion-augmented retrieval and suggests avenues for future exploration in optimizing efficiency, supporting multimodal queries, and extending to real-world applications.

References

  • (1)
  • Creswell et al. (2018) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE signal processing magazine 35, 1 (2018), 53–65.
  • Croitoru et al. (2023) Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion Models in Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 45, 9 (2023), 10850–10869.
  • Das et al. (2019) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Stefan Lee, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2019. Visual Dialog. IEEE Trans. Pattern Anal. Mach. Intell. 41, 5 (2019), 1242–1256.
  • Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In Proceedings of the International Conference on Machine Learning, ICML 2024.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in neural information processing systems conference, NeurIPS 2014, Vol. 27.
  • Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
  • Gui et al. (2020) Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. 2020. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. CoRR abs/2001.06937 (2020).
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the Advances in neural information processing systems conference, NeurIPS, Vol. 33. 6840–6851.
  • Kumar et al. (2024) Ankit Kumar, Richa Sharma, and Punam Bedi. 2024. Towards Optimal NLP Solutions: Analyzing GPT and LLaMA-2 Models Across Model Scale, Dataset Size, and Task Diversity. Engineering, Technology & Applied Science Research 14, 3 (2024), 14219–14224.
  • Lee et al. (2024) Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. 2024. Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 791–809.
  • Levy et al. (2023) Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2023. Chatting Makes Perfect: Chat-based Image Retrieval. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS, 2023.
  • Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning, ICML 2022, Vol. 162. PMLR, 12888–12900.
  • Li et al. (2024) Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, and Tat-Seng Chua. 2024. Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 11851–11861.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of The European Conference on Computer Vision ECCV 2014, Vol. 8693. Springer, 740–755.
  • Murahari et al. (2020) Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. 2020. Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline. In Proceedings of The European Conference on Computer Vision ECCV 2020, Vol. 12363. Springer, 336–352.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, ICML 2021,, Vol. 139. PMLR, 8748–8763.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of The International conference on machine learning ICML. PMLR, 8821–8831.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022,. IEEE, 10674–10685.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of The Advances in Neural Information Processing Systems NeurIPS 2022,.
  • Saito et al. (2023) Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. In Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023,. IEEE, 19305–19314.
  • Tejero-de Pablos (2024) Antonio Tejero-de Pablos. 2024. Complementary-Contradictory Feature Regularization Against Multimodal Overfitting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024. 5679–5688.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
  • Wan et al. (2024) Yongquan Wan, Wenhai Wang, Guobing Zou, and Bofeng Zhang. 2024. Cross-modal feature alignment and fusion for composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. 8384–8388.
  • Wang et al. (2022) Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. 2022. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. CoRR abs/2208.10442 (2022).
  • Wu et al. (2021) Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogério Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. In Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. Computer Vision Foundation / IEEE, 11307–11317.
  • Wu et al. (2023) Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136.
  • Xu et al. (2024) Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, and Xuanzhe Liu. 2024. A Survey of Resource-efficient LLM and Multimodal Foundation Models. CoRR abs/2401.08092 (2024).
  • Xue et al. (2024) Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S. Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2024. xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. CoRR abs/2408.08872 (2024).
  • Yoon et al. (2024) Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Kang Zhang, Yu-Jung Heo, Du-Seong Chang, and Chang D. Yoo. 2024. BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation. In Proceedings of the The European Conference on Computer Vision ECCV 2024., Vol. 15089. Springer, 378–396.
  • Yuan and Lam (2021) Yifei Yuan and Wai Lam. 2021. Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback. In Proceedings of The International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021. ACM, 839–848.
  • Zhai et al. (2024) Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. 2024. Investigating the catastrophic forgetting in multimodal large language model fine-tuning. In Proceedings of The Conference on Parsimony and Learning. 202–227.
  • Zhu et al. (2024) Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, and Evangelos Kanoulas. 2024. Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models. In Proceedings of the 2024 International Conference on Multimedia Retrieval, ICMR 2024. ACM, 978–987.