CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models
^†^†thanks: * contribute equally to this work and $\dagger$ corresponding authors.

Yeyuan Wang^∗,1, Dehong Gao^∗,2, Bin Li¹, Rujiao Long³, Lei Yi³,
Xiaoyan Cai^†,1, Libin Yang^†,2, Jinxia Zhang^4,5, Shanqing Yu^6,7, Qi Xuan^6,7
¹School of Automation, Northwestern Polytechnical University, Xi’an, Shaanxi, China
²School of Cybersecurity, Northwestern Polytechnical University, Xi’an, Shaanxi, China
³Alibaba Group, Hangzhou, Zhejiang, China
⁴The Key Laboratory of Measurement and Control of CSE, Ministry of Education,
School of Automation, Southeast University, Nanjing 210096, China
⁵Advanced Ocean Institute of Southeast University, Nantong 226010, China
⁶Zhejiang University of Technology, Hangzhou, Zhejiang, China
⁷Binjiang Institute of Artificial Intelligence, Hangzhou, Zhejiang, China
{wangyeyuan, libin0225}@mail.nwpu.edu.cn, {dehong.gdh, xiaoyanc, libiny}@nwpu.edu.cn
{rujiao.lrj, yilei.yi}@alibaba-inc.com, jinxiazhang@seu.edu.cn, {yushanqing, qixuan}@zjut.edu.cn

Abstract

The impressive performance of Large Language Model (LLM) has prompted researchers to develop Multi-modal LLM (MLLM), which has shown great potential for various multi-modal tasks. However, current MLLM often struggles to effectively address fine-grained multi-modal challenges. We argue that this limitation is closely linked to the models’ visual grounding capabilities. The restricted spatial awareness and perceptual acuity of visual encoders frequently lead to interference from irrelevant background information in images, causing the models to overlook subtle but crucial details. As a result, achieving fine-grained regional visual comprehension becomes difficult. In this paper, we break down multi-modal understanding into two stages, from Coarse to Fine (CoF). In the first stage, we prompt the MLLM to locate the approximate area of the answer. In the second stage, we further enhance the model’s focus on relevant areas within the image through visual prompt engineering, adjusting attention weights of pertinent regions. This, in turn, improves both visual grounding and overall performance in downstream tasks. Our experiments show that this approach significantly boosts the performance of baseline models, demonstrating notable generalization and effectiveness. Our CoF approach is available online at https://github.com/Gavin001201/CoF.

Index Terms:

vision and language, prompt engineering, fine-grained understanding

Refer to caption — Figure 1: Overview of (a) baseline approach and (b) our CoF approach. CoF consists of two stages: (1) Location Grounding and (2) Attention Reweighting. In the first stage, the MLLM takes the question Q, the grounding prompt P ${}_{\text{g}}$ , and image I as input and obtain the coarse-grained coordinates of the answer region. In the second stage, we reweight the attention score of the answer region according to the coarse-grained coordinates obtained in the first stage to guide the model to focus on the answer region.

I Introduction

Large Language Model (LLM) has demonstrated remarkable capabilities in natural language processing (NLP) [1, 2]. Researchers have further advanced these models by integrating them with visual perception modules, resulting in Multi-modal Large Language Model (MLLM) [3, 4, 5]. Experimental evidence indicates that MLLM performs effectively across a variety of multi-modal tasks [6, 7, 8]. However, the limited perceptual abilities of current visual perception models hinder their capacity for fine-grained understanding [9, 10, 11, 12].

Prominent open-source MLLM typically utilizes the model pretrained with contrastive learning (e.g., CLIP [13]) as visual perception component [4]. However, it is reported that this instance-level contrast [13, 14] does not work well in fine-grained perception [15, 16, 17]. This limitation is transferred to MLLM that integrate these visual perception models, rendering them more susceptible to distractions from irrelevant information [18, 10]. Consequently, they frequently struggle to identify subtle yet critical target information, negatively affecting their performance in fine-grained downstream tasks and increasing the likelihood of generating “hallucinations”.

Considerable research has attempted to address this limitation [19, 20]. These studies focus on enhancing the visual perception ability of MLLM by increasing image resolution or introducing auxiliary information [11, 19, 20]. Moreover, researchers have observed that the CLIP visual encoder tends to prioritize content within specific marked areas [21, 22, 23]. Building on this observation, they have employed visual prompt engineering to guide the model’s attention towards specific areas, optimizing its capture of target information [24]. Subsequent efforts have integrated these findings into MLLM, specifically by creating region-level image-text pair datasets with distinct visual markers and adjusting model parameters to enhance MLLM fine-grained understanding capabilities [11, 25, 26]. However, these methods often lead to higher computational and require substantial human labor. Thus, there is an urgent need for a more efficient approach.

To comprehend complex visual information, humans typically focus on specific regions or details within a given sample. For the situation shown in Fig. 1, when asked with providing a detailed description of a local area or inquiring about the attributes of a small target object, humans generally scan the entire image, locate the approximate position of the target, and then concentrate on it. In contrast, most MLLM process image information at a fixed granularity (e.g., Fig. 1(a)) [27, 28, 29], making it challenging to capture subtle but critical information that may be difficult to detect visually. To mimic efficient human-like reasoning, models need to identify the target image region containing essential visual details and pay more attention to it to capture fine-grained critical visual information (e.g., Fig. 1(b)). This is difficult for current MLLM to deal with, causing them to be easily affected by background noise or trying to find answers from text prompts and amplifying the influence of language priors, resulting in poor performance on fine-grained tasks and hallucination problems.

In this paper, we propose a Coarse-to-Fine-grained (CoF) multi-modal understanding approach aimed at improving the fine-grained perception capabilities of MLLM. Different from current Chain-of-Thought (CoT) approaches that directly guide the input and output of the model, our approach operates in the latent variable space. The proposed CoF approach consists of two main stages. The first stage involves identifying the target area(s) within the given image, while the second stage focuses on comprehending its(their) semantic information. We simplify fine-grained multi-modal understanding into two manageable stages by breaking down the image understanding process. The initial identification of the target area allows for greater concentration on this critical region in the subsequent stage, thereby reducing interference from irrelevant information. This refined focus increases the likelihood of achieving an accurate understanding of the target area. In our specific implementation, we first prompt the MLLM to ascertain the coordinates of the target area, subsequently assigning higher attention weights to the visual tokens in this region to precisely direct the model’s focus. This method significantly improves its fine-grained understanding capabilities. Extensive experiments validate the effectiveness of our approach.

The primary contributions of this paper are as follows: (1) We introduce the CoF (Coarse-to-Fine) approach, which streamlines image understanding into two stages: coarse-grained localization and fine-grained perception. (2) We integrate visual prompt engineering technology to bolster the fine-grained understanding capabilities of MLLM in a low-resource manner. (3) We conduct comprehensive experiments that demonstrate CoF’s robust fine-grained understanding capabilities across a wide range of downstream tasks.

II METHOD

As illustrated in Fig. 2, our CoF comprises two primary stages: (1) the coarse-grained location grounding stage and (2) the fine-grained attention reweighting stage. This section delves into a comprehensive description of the approach.

II-A Preliminaries

MLLM typically consists of a visual encoder $v_{\phi}(\cdot)$ , a LLM decoder $f_{\theta}(\cdot)$ , and an image-text alignment module $p_{\epsilon}(\cdot)$ (parameterized by $\phi$ , $\theta$ , $\epsilon$ respectively). Given an image I and a textual prompt P, the image I is first encoded into visual embeddings via $v_{\phi}(\cdot)$ , which are then mapped to a shared LLM embedding space through $p_{\epsilon}(\cdot)$ and result in a set of visual tokens $e_{v}$ . The LLM decodes the output response R in a sequence manner, which is formulated as:

R=f_{\theta}(p_{\epsilon}(v_{\phi}(\textit{I})),\textit{P})

(1)

For decoder, the Transformer-based LLM is centered on the attention module and model the relationship between $e_{v}$ and $e_{t}$ through the attention mechanism. Specifically, the model computes the attention map $A$ to represent the relationship between $e_{v}$ and $e_{t}$ using the following formula:

A=softmax(\frac{[e_{v},e_{t}]\cdot([e_{v},e_{t}])^{T}}{\sqrt{d_{k}}}),

(2)

where $d_{k}$ is a scaling facter. The attention matrix $A$ comprises elements $A_{ij}$ with $i,j\in{1,2,\ldots,n}$ , representing the relationship between the $i$ -th token and the $j$ -th token, and their impact on the output. The exact visual encoder, LLM decoder, image-text alignment module and pretraining method for parameters $\phi$ , $\theta$ , and $\epsilon$ differ between models but the overarching method described above remains the same.

II-B Coarse-grained Location Grounding

In this stage, we leverage the grounding capability of MLLM to locate the answer in the image. As shown in Fig. 2, given the image I and the corresponding question Q, we employ a manually designed grounding prompt P ${}_{\text{g}}$ (e.g. “According to the information in the image and the question, detail the bounding box of the region in the image that contains the answer in JSON format.”) to guide the MLLM to identify the approximate area in the image related to the question and output its corresponding bounding box coordinates. Following the acquisition of the answer region’s coordinates, we implement a post-processing step aimed at optimizing this output for the subsequent stage of attention reweighting, which generates the accurate response to question $Q$ . The post-processing process is described in the following.

The initial step involves determining the central coordinates of the bounding box and adjusting the bounding box based on them. This process introduces an expanding hyper-parameter $\alpha$ , which dictates the appropriate size to which the bounding box should be expand to. For cases that exceed the image boundary, we move it to keep it within the image.

At this stage, we have acquired the coarse-grained position coordinates of the answer area, which contain sufficient information to generate the correct response and will be utilized to guide the reweighting of visual attention in the second stage.

II-C Fine-grained Attention Reweighting

After obtaining the coordinates of the target region in the first stage, we convert it into a binary mask matrix $M$ . Specifically, for elements within the target region, we assign a value of 1, while assigning 0 to all others. This binary mask serves as a condition for decoding the subsequent sequence. In practice, for each attention layer in the decoder, we multiply the attention scores of the image tokens corresponding to $M$ by a scaling factor $\lambda$ , directing the model’s focus toward a specific image region. The modified attention map $\hat{A}$ is formulated as:

\hat{A}=softmax(log(\lambda)\cdot M+A),

(3)

where $M$ is the binary mask matrix and $A$ denotes the origin attention score matrix. The attention reweighting strategy activates the attention score of specific tokens in the attention modules, thereby effectively shift the focus of the model to specific tokens to achieve more detailed and precise control over the output. Equation 3 presents the SoftMax probability on the activated fraction, with a consistent multiple increase in probability on specific visual tokens. Through the attention reweighting mechanism, the model pays higher attention to the answer region to generate accurate answers to the question.

TABLE I: Evaluations on comprehensive VLM benchmarks, including MME, MMBench and POPE.

Method	Projector	MME		MMBench		POPE			Sum
Method	Projector	Perception	Cognition	Dev	Test	Random	Popular	Adversarial	Sum
BLIP2-13B [30]	Q-former	1293.8	290.0	-	-	89.6	85.5	80.9	-
MiniGPT-4-7B [3]	Resampler	581.7	144.3	23.0	-	-	-	-	-
Shikra-13B [31]	Linear	-	-	58.8	-	-	-	-	-
Qwen-VL-chat-7B [32]	Resampler	1487.5	360.8	60.6	-	-	-	-	-
mPLUG-Owl2-7B [33]	Resampler	1450.2	-	64.5	-	-	-	-	-
LLaVA-v1.5-7B [4]	MLP	1510.7	285.0	64.3	66.4	87.3	86.1	84.2	2184.0
LLaVA-v1.5-7B+CoF	MLP	1515.9	334.0	65.1	67.7	87.4	86.3	84.2	2240.6_+56.6
LLaVA-v1.5-13B [4]	MLP	1531.3	295.4	67.7	67.0	87.1	86.2	84.5	2219.2
LLaVA-v1.5-13B+CoF	MLP	1545.6	310.7	68.3	69.2	88.0	86.6	84.8	2253.2_+34.0
InstructBLIP-13B [5]	Q-former	1212.8	291.8	29.2	36.7	87.7	77.0	72.0	1807.2
InstructBLIP-13B+CoF	Q-former	1236.1	290.8	29.2	50.6	87.5	84.8	82.5	1861.5_+54.3

TABLE II: Ablation studies on LLaVA-v1.5-13B model.

Reweight	Ground	MME		MMBench
Reweight	Ground	Perception	Cognition	Dev	Test
		1531.3	295.4	67.7	67.0
✓		1527.2	306.8	68.5	69.1
✓	✓	1545.6	310.7	68.3	69.2

III Experiment

III-A Experimental Setup

We apply our CoF approach to the popular LLaVA-v1.5-7B, LLaVA-v1.5-13B [4] and InstructBLIP-13B [5]. In the query-based mapping method InstructBLIP-13B, we apply the attention reweighting method to Q-former. These models are evaluated on benchmarks (i.e., MME [34], MMBench [35], POPE [36]), focusing on multi-modal understanding and hallucination tasks. MME measures both perception and cognition abilities on a total of 14 subtasks. MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of MLLM. POPE is a polling-based query method for better evaluation of object hallucination. Regarding hyperparameter settings, the scaling hyperparameter $\alpha$ is set to 1.3, and $\lambda$ to 2.0 for the LLaVA-v1.5-7B model. For the LLaVA-v1.5-13B model, $\alpha$ is set to 1.0 while $\lambda$ is adjusted to 4.5. In the case of InstructBLIP-13B, $\alpha$ is set to 1.0 and $\lambda$ is set to 22.0. We performed all experiments on an NVIDIA-A800-80GB and a Tesla-V100-sxm2-32GB GPU.

III-B Experimental Results

Our primary experimental results are summarized in Table I. The data indicate a marked improvement in the performance of our approach across three benchmarks, demonstrating its effectiveness. Specifically, the results on the MME and MMBench benchmarks highlight advancements in multi-modal understanding tasks. Furthermore, the enhancement observed in the POPE benchmark underscores the success of CoF in mitigating hallucinations in MLLM. This supports our hypothesis that enhancing the model’s grounding capabilities contributes to improved fine-grained understanding and reduces the effects of hallucinations. Significant advancements were achieved in the MME perception task, which illustrates the superiority of our coarse to fine-grained perception approach over traditional fixed-grained methods. These results illustrate the model’s ability to accurately identify the relevant answer areas, thereby streamlining the inference process from the original image and enhancing the overall performance of the model.

III-C Ablation Study

In this section, we present comprehensive ablation experiments on the LLaVA-v1.5-13B model. The results are summarized in Table II. We first evaluate the efficacy of the attention reweighting mechanism by scaling the attention scores of all image tokens. Previous research indicates that as the length of the output sequence increases, the model’s reliance on visual prompts diminishes, resulting in a tendency to “forget” input image information [37]. Consequently, the model may over-rely on previously generated text sequences to predict answers, making it more susceptible to hallucinations due to text priors [37]. Compared with the baseline, the introduction of the attention reweighting mechanism emphasizes the importance of visual prompts in answer prediction, guiding the model to predict answers that are more faithful to the image, thereby improving performance on the benchmarks. However, this variant still processes image information at a fixed granularity and lacks local fine-grained visual attention guidance, making it easily affected by background noise.

To address this limitation, we introduced the location grounding stage to direct the model’s focus toward the relevant local context. Previous studies employed the grounding capability of MLLM to define the answer area by cropping and scaling it [38, 39], which often compromises the overall semantic integrity of the image. This is particularly detrimental when the model exhibits relatively weak grounding capabilities, as it can lead the model to predict incorrect answers. However, CoF successfully preserves the global semantic information of the image while simultaneously highlighting the targeted local area, thereby diminishing the occurrence of false positive predictions. Ultimately, CoF operates through the aforementioned two-stage approach, transitioning from coarse-grained to fine-grained image perception while retaining comprehensive semantic information, which contributes to achieving optimal performance.

IV Conclusion

In this paper, we proposed a step-by-step multi-modal understanding approach from coarse to fine (CoF). By decomposing the complex image understanding task into two stages: coarse-grained localization and fine-grained recognition, our CoF approach improved the overall performance on multiple downstream tasks, verifying its feasibility. In addition, the imitation of the human visual perception process makes CoF more interpretable. In the future, we will continue to study the mechanism of attention in the process of image and text interaction, and move forward towards improving the fine-grained understanding ability of the model and continue to explore the interpretability of MLLM.

References

[1] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[2] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023.
[3] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
[4] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306.
[5] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023. [Online]. Available: https://arxiv.org/abs/2305.06500
[6] Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan, “Making llama see and draw with seed tokenizer,” arXiv preprint arXiv:2310.01218, 2023.
[7] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[8] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[9] D. Gao, L. Jin, B. Chen, M. Qiu, P. Li, Y. Wei, Y. Hu, and H. Wang, “Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 2020, pp. 2251–2260.
[10] H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang, “Vary: Scaling up the vision vocabulary for large vision-language models,” arXiv preprint arXiv:2312.06109, 2023.
[11] Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 202–28 211.
[12] Y. Wang, D. Gao, L. Yi, L. Jin, J. Zhang, L. Yang, and X. Cai, “Enhancing fine-grained vision-language pretraining with negative augmented samples,” arXiv preprint arXiv:2412.10029, 2024.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021, pp. 8748–8763.
[14] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916.
[15] M. Zhuge, D. Gao, D.-P. Fan, L. Jin, B. Chen, H. Zhou, M. Qiu, and L. Shao, “Kaleido-bert: Vision-language pre-training on fashion domain,” in Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 647–12 657.
[16] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” in International Conference on Learning Representations, 2021.
[17] H. Singh, P. Zhang, Q. Wang, M. Wang, W. Xiong, J. Du, and Y. Chen, “Coarse-to-fine contrastive learning in image-text-graph space for improved vision-language compositionality,” in Conference on Empirical Methods in Natural Language Processing, 2023.
[18] C. Chen, R. Qin, F. Luo, X. Mi, P. Li, M. Sun, and Y. Liu, “Position-enhanced visual instruction tuning for multimodal large language models,” arXiv preprint arXiv:2308.13437, 2023.
[19] S. Xuan, Q. Guo, M. Yang, and S. Zhang, “Pink: Unveiling the power of referential comprehension for multi-modal llms,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 838–13 848.
[20] J. Jain, J. Yang, and H. Shi, “Vcoder: Versatile vision encoders for multimodal large language models,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 992–28 002.
[21] A. Shtedritski, C. Rupprecht, and A. Vedaldi, “What does clip know about a red circle? visual prompt engineering for vlms,” in International Conference on Computer Vision, 2023, pp. 11 987–11 997.
[22] L. Yang, Y. Wang, X. Li, X. Wang, and J. Yang, “Fine-grained visual prompting,” Neural Information Processing Systems, vol. 36, 2024.
[23] M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y. Chai, D. Park, and Y. J. Lee, “Vip-llava: Making large multimodal models understand arbitrary visual prompts,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 914–12 923.
[24] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” arXiv preprint arXiv:2310.11441, 2023.
[25] Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, and S. Liu, “Regiongpt: Towards region understanding vision language model,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 796–13 806.
[26] Y. Zhang, S. Qian, B. Peng, S. Liu, and J. Jia, “Prompt highlighter: Interactive control for multi-modal llms,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 215–13 224.
[27] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
[28] S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain, C.-F. Yeh, P. Murugesan, P. Heidari, Y. Liu et al., “Anymal: An efficient and scalable any-modality augmented language model,” arXiv preprint arXiv:2309.16058, 2023.
[29] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
[30] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 19 730–19 742.
[31] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv preprint arXiv:2306.15195, 2023.
[32] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
[33] Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, and F. Huang, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 040–13 051.
[34] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2306.13394
[35] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu et al., “Mmbench: Is your multi-modal model an all-around player?” arXiv preprint arXiv:2307.06281, 2023.
[36] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” arXiv preprint arXiv:2305.10355, 2023.
[37] A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto, “Multi-modal hallucination control by visual information grounding,” in Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 303–14 312.
[38] H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y. Liu, and H. Li, “Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models,” arXiv preprint arXiv:2403.16999, 2024.
[39] B. Luan, H. Feng, H. Chen, Y. Wang, W. Zhou, and H. Li, “Textcot: Zoom in for enhanced multimodal text-rich image understanding,” arXiv preprint arXiv:2404.09797, 2024.

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models ††thanks: * contribute equally to this work and †\dagger corresponding authors.