This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Instructions for *ACL Proceedings

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
\AndSecond Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Abstract

Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in human’s high-level perception. For example, substituting the shopping bag held by a person with a gun suggests a violent behavior, implying criminal activity they may be involved in. Despite significant advancements in various multimodal tasks, Large Visual Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception capabilities of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs. Results show that high-level perception tasks significantly challenge existing LVLMs, where the State-of-The-Art GPT-4o only achieves an accuracy of 76.09%76.09\%, compared with 86.85%86.85\% in the low-level scenario. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs are not generalizable in understanding the visual semantics of synthetic images as humans.

Instructions for *ACL Proceedings


Anonymous ACL submission


1 Introduction

Visual perception (VP) refers to the ability to transform visual signals into meaningful perceptions DEWIT2012665; gordon2019intermodulation. When humans parse visual signals, they initially engage in high-level perception to grasp the overarching concept using commonsense knowledge. This serves as context guidance for exploring further low-level details aligned with their intentions wang2024browse; garner1987metacognition. For example, given an image of a man in a bar, humans first grasp the high-level concept, such as the behavior of drinking, and focus on low-level details, such as the type of alcohol, to obtain specific information. Existing Large Vision-Language Models (LVLMs) demonstrate an exceptional understanding of such low-level visual clues. However, it remains unexplored whether they have similar hierarchical visual perceptions at both levels, like humans.

Refer to caption
Figure 1: A sample of MVP-Bench, which contains both high- and low-level visual perception. Image 1 and Image 2 form an image pair. Their different backgrounds indicate that the man is engaged in different behaviours.

Recently, several benchmarking works consider visual perception in their evaluation liu2023mmbench; fu2024mme. However, as holistic evaluation benchmarks, they lack specialization in assessing visual perceptions. Specifically, most of their tasks focus on low-level perception such as Counting or Existence Detection questions on single images. Nevertheless, existing benchmarks are mostly designed based on individual question-image samples, failing to evaluate the consistency and accuracy of understanding an image with different types of perceptions. Furthermore, most of the current benchmarks are built on real-world natural image data, making it hard to disentangle the reliance on prior knowledge and the visual perception of specific contexts, such as synthetic images bitton2023breaking. Motivated by the challenges of interpreting the visual perception capabilities of LVLMs, we propose MVP-Bench, the first benchmark systematically evaluating the multi-level visual perceptions of LVLMs. We thoroughly design 5 high-level and 13 low-level perception categories, which we will detail in Section 3. Furthermore, we construct natural-synthetic image pairs which convey contrasting perceptions as a more challenging tasks on visual perceptions. As shown in Figure 1, each sample is accompanied by questions at both levels.

To conclude, we introduce MVP-Bench, the first benchmark systematically evaluating the Multi-level visual perception of LVLMs. We evaluate 12 LVLMs and find that high-level visual perception is more challenging than low-level perception, and LVLMs tend to perform worse on manipulated images than natural images attributing to the distribution of their training data. Besides, our further qualitative analysis reveals current LVLMs’ deficiency and the gap between open- and closed-source models.

2 Related Work

Visual Perception.

Visual Perception represents how the human brain transforms the pattern of information on the retina into a meaningful perception of the world DEWIT2012665; cornsweet2012visual. This process involves interactions among sensory and cognitive processes across hierarchical levels in the brain gordon2019intermodulation; ROUW1997209. Low-level visual features refer to the properties like colors and spatial attributes, while high-Level visual processing is integrated with human cognitive functions (e.g. commonsense knowledge, personal experiences) related to the recognized objects akcelik2022influence; wu2023q; 1180642113; SCHINDLER202114. Both perception competences are crucial, as human visual perception begins with grasping the image’s main idea at a high level, and then delving into low-level features motivated by particular intentions garner1987metacognition. In MVP-Bench, we define five high-level categories and thirteen low-level categories. The mapping relationships between levels indicate that certain low-level features can support the high-level perception (illustrated in Section 3).

Vision-Language Benchmarks.

Some recent benchmarks contain visual perception as a section, but their aim to offer a comprehensive evaluation of LVLMs’ various capabilities leads to an inadequate exploration of visual perception across different levels. MMBench liu2023mmbench and MME fu2024mme categorize visual perception as based on question granularity. Although coarse perception questions are general, their questions like counting or exitence detection cannot reflect an image’s main idea. Additionally, they evaluate different categories of visual perception individually, making it unavailable to compare an LVLM’s different perceptions. The definition of perception in PCA-Bench chen2024pca resembles our benchmark, emphasizing how perception offers a guiding context in decision-making domains. However, their images depicting environment normally do not require significant high-level perception. MVP-Bench systematically evaluates LVLMs’ multi-level visual perception, with each image accompanied by high-level and low-level questions simultaneously. To ensure each case requires both high- and low-level perceptions, we construct image pairs containing humans peng2023agenda; thomson2022visual.

Synthetic Images.

Recent advancements in image generation tools ramesh2021zero; rombach2021highresolution and image editing models brooks2023instructpix2pix; zhang2023sine have led to synthetic datasets for different tasks, such as Whoops bitton2023breaking and StableRep tian2024stablerep. In the process of utilizing text-to-image tools for generating synthetic images, a prompt aligned with the expected image content is essential. In previous works, the source of such prompts can be manually crafted prompts bitton2023breaking, text annotations in existing datasets tian2024stablerep, and prompts generated by ChatGPT aboutalebi2024magid; li2023stablellava; wu2023visual. In MVP-Bench, we generate manipulated images for constructing image pairs. To obtain a prompt tailored to each case and minimize human labor, we employ ChatGPT to generate the manipulation instructions. Further details will be provided in Section 4.1.

3 MVP-Bench Evaluation Suite

MVP-Bench comprises 530530 natural-manipulated image pairs accompanied by questions at multiple perception levels. Using MVP-Bench, we diagnose LVLMs by investigating (1) the performance gap between high- and low-level visual perceptions and (2) the difference in visual understanding abilities on natural and manipulated images.

3.1 Evaluation across Perception Levels

This section details the definitions of multi-level visual perception in MVP-Bench. We prioritize the perception of humans as high-level perception, e.g., misinformation understanding da2021edited, and emotion recognition hari2009brain, where high-level perception is commonly engaged.

We categorize high-level (L-H) perceptions of humans into five dimensions, including Behaviour, Role, Identity, Emotion, Scenario. Each dimension corresponds to several low-level (L-L) perception types. As shown in Figure 3 (a), certain low-level perceptions (e.g., attireattire such as a police uniform or groupassociationgroupassociation with firefighters) can support the high-level perception (e.g., RoleRole).

We design Yes/No questions and Cross-Image questions at both levels. Constructed on the same set of images, the multi-level perception tasks enable us to diagnose the perception gap in LVLMs across different levels. Specifically, we calculate the accuracy on Yes/No questions based on the correctness of each individual question-image pair (represented as aAccaAcc), while all multiple-choice questions within MVP-Bench are evaluated with Circular Strategy liu2023mmbench to alleviate the model prediction bias from the option order.

3.2 Evaluation with Image Pairs

Each natural-manipulated image pairs within MVP-Bench conveys significantly different multi-level perceptions. Specifically, the two images differ only in one of the L-L perception categories (in Figure 3 (a)), leading to distinct L-H perceptions. To mitigate the effect of the LVLMs’ tendency in answering Yes/No questions liu2023hallusionbench, we examine if LVLMs can elicit different perceptions given an image pair with the same question. We further explore the performance gap in LVLMs on natural and manipulated images in Section 5.

For Yes/No questions, we ask the same question on pairwise image data. As the two images are manipulated to convey different perceptions, they have opposite corresponding ground truth answers. We calculate qAccqAcc and iAcciAcc based on question-level and image-level accuracy, respectively, following (liu2023hallusionbench). We design a holistic metric mAccmAcc, requiring answering all questions corresponding to a image pair correctly.

For single-image multiple-choice questions, we focus on model understanding of manipulated images as a more challenging task. We include the answer to the natural image as a distractor to assess the discriminability of LVLMs in discerning the differences between the image pair. Additionally, we leverage ChatGPT111We used gpt-3.5-turbo-1106. to generate three other options aligned with the low-level clues in the manipulated image to heighten our task difficulty.

4 MVP-Bench Construction

In this section, we present the construction process of image pairs and the collection of corresponding questions at both perception levels.

4.1 Construction Pipeline

We select images from the EMU dataset da2021edited as natural images for constructing image pairs. EMU focuses on visual misinformation, portraying cases involving humans and complex social scenes that require perceptions at both levels. Based on the natural image, we generate synthetic manipulations following one of the L-L categories.

However, to alter manipulated images’ L-H perceptions in certain categories, it is challenging to constrain the manipulation applied exactly to a specific L-L category without significant modification on other details. Besides, it is also hard to ensure consistency between the image pairs and the questions. We propose a three-step benchmark construction pipeline to meet the two requirements.

Refer to caption
Figure 2: The MVP-Bench construction pipeline consists of three steps. Step 1 uses three categories (‘Behaviour-Background’, ‘Role-Clothes’, ‘Emotion-Facial Expression’) as examples to illustrate how high-level perception guides the identification of low-level perception. Step 2 demonstrates three categories of manipulated image generation: Overall Background Substitution, Partial Component Substitution, and Direct Alteration (from left to right). Step 3 explains how to generate questions based on the ideas obtained in Step 1, with the same colour indicating that the generated question is based on the corresponding part from the expected perception.

Step one: Idea Generation.

We utilize ChatGPT to generate ideas on how to manipulate natural images via Chain of Thoughts (CoT). Given an initially determined L-H category, we prompt ChatGPT to identify a corresponding low-level perception to support it. For instance, in Figure 2, considering the “Behaviour-Background Substitution” category, ChatGPT first generates an idea to change the woman’s behaviour from attending a party to engaging in an experiment. Under this guidance, the background of the manipulated image should be a laboratory environment. Specifically, we provide auxiliary information such as the description of the manipulated image, which is incorporated into the textual prompt for image generation in Step 2.

To ensure coherence between the generated idea and the subsequent visual editing, we fixate on a specific subject at this initial step utilizing the visual grounding ability of Shikra (chen2023shikra). Specifically, we employ Shikra to retrieve the coordinates of a selected subject (CsubC_{sub}) and utilize it to query low-level features (e.g., “What is the man holding?") from the image in the subsequent steps.

Step two: Manipulated Image Generation.

We define three categories of manipulated image generation based on the image-editing type: Overall Background Substitution, Partial Component Substitution, and Direct Manipulation.

1. Partial Component Substitution. Partial Component Substitution refers to manipulating an image by substituting an object or a part of the main subject. The pipeline utilizes Shikra to extract the target object’s coordinates (CobjC_{obj}), with CsubC_{sub} serving as a constraint. After masking CobjC_{obj} as a blank, we apply the Stable-Diffusion-Inpaint stacchio2023stableinpainting as a tool, using the edited image’s caption obtained from step one as the prompt to generate a manipulated image. A set of defined L-L categories, {B2,B3,B4,R2,I1,I2,I3,E1}\{B_{2},B_{3},B_{4},R_{2},I_{1},I_{2},I_{3},E_{1}\}, can be executed in this process.

Refer to caption
Figure 3: MVP-Bench statistics. (a) shows 5 high-level (l-H) categories and 13 low-level (L-L) categories, where the mapping relationship indicates that the low-level features can support certain high-level perception. (b) shows the distribution of questions. Y/N, CI, MCQ denotes Yes/No questions, cross-image questions, and single-image multiple-choice questions respectively. (c) demonstrates the distribution of images with questions at different levels.

2. Overall Background Substitution. Overall Background Substitution represents generating a manipulated image by retaining solely the main subject while replacing the entire background. In these cases, a standard rectangle cannot exactly mask the subject, potentially remaining unexpected element and distorting the background generation. To address this limitation, we employ the Segment Anything Model to produce a set of detected object masks (\mathbbmM={M1,M2,,Mn}\mathbbm{M}=\{M_{1},M_{2},...,M_{n}\}) in irregular shapes for a given image. We identify a mask with the greatest overlap with CsubC_{sub}.

mask=argmaxMi\mathbbmMOverlap(Mi,Csub)mask=\mathop{\arg\max}\limits_{M_{i}\in\mathbbm{M}}Overlap(M_{i},C_{sub}) (1)

Here, OverlapOverlap refers to a function that calculates the overlapping square between two regions. To enhance flexibility and increase the case difficulty, we randomly translate the location of CsubC_{sub}, rescale the CsubC_{sub}, and resize the entire mask. Finally, with the new mask and the manipulated image’s caption obtained from Step 1, we utilize Stable-Diffusion-Inpaint to generate a new image with a different background from the original natural image. This process can handle {B1,R1,S1}\{B_{1},R_{1},S_{1}\}.

3. Direct Alteration. Direct Alteration addresses situations where nothing can be substituted, yet some alteration is necessary, such as changing facial expressions. With the original natural image and the manipulation instruction obtained from Step 1, we directly utilize the image-editing model InstructPix2Pix brooks2023instructpix2pix to generate a manipulated image for E2,S2{E_{2},S_{2}}. However, since this process cannot focus on specific subjects, we mainly apply it to images containing a single person or cases requiring overall manipulations.

Step three: Visual Question Generation.

We generate Yes/No questions, Single- and Cross-Image multiple-choice questions using ChatGPT based on the ideas generated at Step 1. Single-Image questions focus on the discrepancy between image pairs, while Cross-Image tasks focus on the differences between each image pair. To ensure the quality of generated questions, two of this paper’s authors manually verified all 32053205 questions. A question was retained only when both annotators accept it. Finally, 18721872 questions are retained within the MVP-Bench. While verifying Yes/No questions, we focused on: (1) the quality of manipulation and (2) the consistency between images and ground truths. For multiple-choice questions, we paid additional attention to cases where distractors were not discrepant with the ground truth. We manually adjusted these distractors and double-checked the cases to ensure both annotators accept it.

4.2 MVP-Bench Statistics

We remain 11051105 high-level questions, including 460460 Yes/No questions, 418418 single-Image multiple-choice questions (MCQ), and 227227 Cross-Image multiple-choice questions (CI). Additionally, we have 767767 low-level questions, comprising 540540 Yes/No questions, and 227227 CI questions (shown in Figure 3). Out of 530530 image pairs, 329329 of them are accompanied by questions at both high and low levels, while 193193 pairs only feature an individual MCQ question at high level.

5 Experiment Results

\multirow3*Models Single-Image Cross-Image
\cmidrule(lr)2-8 \cmidrule(lr)9-12 qAcc aAcc mAcc CircularEval VanillaEval
\cmidrule(lr)2-4 \cmidrule(lr)5-7 \cmidrule8-8 \cmidrule(lr)9-10 \cmidrule(lr)11-12 L-L L-H L-M L-L L-H L-M L-M L-L L-H L-L L-H
DeepSeek (1.3B) 63.33 53.04 58.60 81.48 75.87 78.90 28.40 19.38 18.94 40.97 29.07
MiniCPM-2 (3B) 68.52 55.22 62.40 84.07 76.30 80.50 34.91 29.51 11.45 43.61 31.72
DeepSeek (7B) 70.00 54.35 62.80 84.82 76.09 80.00 33.73 36.12 25.99 47.58 36.56
InstructBLIP (7B) 49.63 40.00 45.20 74.82 69.13 72.20 17.75 0.00 1.32 27.31 23.79
LLaVA-1.5 (7B) 68.89 51.74 61.00 84.45 75.44 80.30 31.36 20.26 14.10 39.21 26.87
MiniGPT4 (8.2B) 14.44 8.26 11.60 39.26 33.70 36.7 0.59 0.00 0.00 2.64 5.73
MiniGPT4-v2 (8.2B) 52.59 40.87 47.20 73.70 67.40 70.80 14.20 0.00 0.00 21.59 24.67
mPLUG-Owl2 (8.2B) 69.26 54.78 62.60 84.63 76.30 80.80 36.09 21.14 13.22 34.80 25.99
InstructBLIP (13B) 50.37 36.09 43.80 75.19 67.61 71.70 15.98 1.76 0.44 25.99 18.50
LLaVA-1.5 (13B) 66.67 52.17 60.00 83.34 76.09 80.00 28.40 25.99 18.06 41.85 32.60
GPT-4V 66.30 39.57 54.00 82.23 69.13 76.2 23.08 44.50 14.10 63.00 37.44
GPT-4o 74.44 56.09 66.00 86.85 76.09 81.9 39.05 74.01 34.80 87.22 51.54
Table 1: High-level visual perception evaluation results. L-L, L-H and L-M denote the low-, high- and multi-level separately. CircularEval shows the results of evaluation with circular strategy, while VanillaEval denotes the results when each question is directly evaluated once. We highlight the models with highest performance on each metric.
\multirow3*Method Yes/No MCQ
\cmidrule(rl)2-8 \cmidrule(rl)9-10 iAcc aAcc mAcc CircularEval VanillaEval
\cmidrule(rl)2-4 \cmidrule(rl)5-7 \cmidrule(rl)8-8 \cmidrule(rl)9-9 \cmidrule(rl)10-10 N M Both N M Both Both Both Both
DeepSeek (1.3B) 60.95 44.38 52.66 83.20 74.60 78.90 28.40 43.78 62.44
MiniCPM-2 (3B) 68.64 53.85 61.24 85.20 75.80 80.50 34.91 44.74 62.20
DeepSeek (7B) 68.05 52.07 60.06 85.00 76.60 80.80 33.73 59.33 74.40
InstructBLIP (7B) 44.38 44.97 44.68 72.40 72.00 72.20 17.75 4.07 19.14
LLaVA-1.5 (7B) 64.50 52.66 58.58 83.20 77.40 80.30 31.36 57.18 71.29
MiniGPT4 (8.2B) 10.06 4.73 7.40 41.80 31.60 36.70 0.59 0.00 2.63
MiniGPT-v2 (8.2B) 53.85 31.95 42.90 79.60 62.00 70.80 14.20 1.91 29.43
mPLUG-Owl2 (8.2B) 66.27 54.44 60.36 84.20 77.40 80.80 36.09 50.72 67.70
InstructBLIP (13B) 41.42 46.15 43.79 70.60 72.80 71.70 15.98 3.83 11.96
LLaVA-1.5 (13B) 58.58 55.62 57.1 81.20 78.80 80.00 28.40 55.02 72.25
GPT-4V 71.07 30.77 50.92 87.80 65.98 76.20 23.08 59.81 72.25
GPT-4o 76.92 48.52 62.72 90.00 73.80 81.90 39.05 64.83 77.27
Table 2: Low-level visual perception results. N, M, Both denote the results on natural, manipulated, and all images respectively. CircularEval shows the results of evaluation with circular strategy, while VanillaEval denotes the results of direct evaluation. We highlight the models with highest performance on each metric.

5.1 Models

We use MVP-Bench to diagnose and compare the visual perception capabilities of various LVLMs belonging to two categories: (1) Open-Source LVLMs including MiniCPM-V-2 (openbmb2024), DeepSeek-VL (lu2024deepseekvl), MiniGPT4 (zhu2023minigpt4), mPLUG-Owl2 (ye2023mplugowl2), InstructBLIP (dai2023instructblip), and LLaVA-1.5 (liu2023improved); (2) Proprietary LVLMs including GPT-4V and GPT-4o. All the experiments are conducted with VLMEvalKit (2023opencompass) under the zero-shot setting for a fair comparison.

5.2 Result Analysis

As outlined in Section 3, we compare the performance of LVLMs at multiple perception levels (Table 1). We also investigate the performance variation when given manipulated images in Table 2.

Performance at Different Perception Levels.

As shown in Table 1, both open- and closed-source models perform worse on high-level perception tasks than low-level ones, e.g., 55.22%55.22\%, 52.17%52.17\%, and 56.09%56.09\% compared to 68.52%68.52\%, 66.67%66.67\%, and 74.44%74.44\% of qAcc on MiniCPM-V-2, LLaVA-1.5-13B, and GPT-4o, respectively. The performance gaps demonstrate the challenges of high-level visual perception tasks. Specifically, we observe that closed-source models present a larger relative performance gap between high-level and low-level perceptions. For example, GPT-4o achieves an accuracy of 34.80%34.80\% (reduced by 52.98%52.98\% from 74.01%74.01\%) on cross-image MCQ, compared to 18.06%18.06\% (reduced by 29.74%29.74\% from 25.99%25.99\%) of LLaVA-1.5-13B. This indicates that the performance gains from closed models mainly come from their superior low-level perceptions, yet they still encounter challenges in high-level tasks. We further discuss the potential cause of this observation in Section 5.3.

Impact of Model Sizes.

Small models can outperform the larger ones in Table 1. Among open-source models, MiniCPM-V-2-3B and DeepSeek-VL-7B achieve the best performance on high-level and low-level tasks respectively. As MiniCPM-V-2 is aligned with fine-grained correctional human feedback, it shows excellent trustworthiness and reduced hallucination. This implies that LVLMs’ trustworthiness may benefit their high-level visual perception. DeepSeek-VL demonstrates a strong capability of perceiving specific details with additional visual encoders for processing low-level features, indicating these features are crucial to low-level visual perception. Besides, comparing LLaVA and InstructBLIP with different sizes reveals that increasing parameters from 7B to 13B does not notably enhance their visual perception at either level. Therefore, to enhance LVLMs’ single-image visual perception, focusing on their ability to provide trustworthy answers and capture low-level features is more effective than simply scaling up.

Analysis on the Cross-Image Task.

Table 1 shows that closed-source models significantly surpass open-source models on cross-image task, especially at low perception level. For instance, GPT-4V and GPT-4o achieve accuracies of 44.50%44.50\% and 74.01%74.01\% respectively at low level, significantly surpassing the accuracy of LLaVA-1.5-13B (25.99%25.99\%). Furthermore, this performance gap is larger than that observed in single-image task. Specifically, in cross-image task, GPT-4o outperforms LLaVA-1.5-13B by 92.69%92.69\% and 184.76%184.76\% at two levels separately, compared to just 7.5%7.5\% and 11.65%11.65\% in single-image tasks. The significant gap indicates open-source LVLMs’ insufficient contextual attention, due to a lack of cross-image data during training.

Comparison between natural-manipulated image pair.

As shown in Table 2, both open- and closed-source models show inferior performance on manipulated images compared to natural images. For example, MiniCPM-V-2, LLaVA-1.5-13B, and GPT-4o achieve an iAcciAcc of 68.64%68.64\%, 58.58%58.58\%, and 76.92%76.92\% on natural images, while exhibiting lower iAcciAcc of 53.85%53.85\%, 55.62%55.62\%, and 48.52%48.52\% on manipulated images. We attribute this observation to the discrepancy between the visual perception of manipulated images and LVLMs’ training data. Besides, closed-source models demonstrate a larger performance gap across image pairs than open-source models. The iAcciAcc gap of GPT-4V and GPT-4o is 40.3%40.3\% and 28.4%28.4\% separately, while LLaVA-1.5-13B and MiniCPM-V-2 have gaps of only 2.96%2.96\% and 14.79%14.79\%. One reason for this is the rigorous manner of GPT-4V and GPT-4o (in Section 5.3). Unlike humans who only focus on relevant elements, these models equally scrutinize all the details with their prior knowledge. Their tendency to provide critical answers impedes the visual perception of manipulated images.

Yes/No v.s. MCQ

GPT-4V and GPT-4o present conflicting results on different tasks. Although both tasks are based on the manipulated images, two models perform poor on Yes/No task with an iAcciAcc of 30.77%30.77\% and 48.52%48.52\%, while outperforming all open-sourced models on the MCQ task. From Table 2, we can witness that the results of MCQ and iAcciAcc on natural images share the same trend, which suggests that closed-source models’ inferior performance on manipulated images is owing to the nature of Yes/No questions. As an open-ended generative task, these models tend to perform rigorously and safely, while the MCQ task is less influenced by their rigorous manner. This is also a motivation for us to design both tasks for single-image perception.

5.3 Discussion

In this section, we present our qualitative analysis observations, investigating the poor performance of GPT-4V on Yes/No questions, the gap between open-source and closed-source models, and the deficiencies of current LVLMs.

Rigurous Behaviors of GPT-4V to Perform High-Level Perception.

Although GPT-4V exhibits the highest level of security among current LVLMs, its rigorous manner may hinder the straightforward perception of common visual contents. Specifically, GPT-4V usually approves only what it can directly observe from the image. It tends to refuse to interpret uncertain cases, such as conducting high-level perception without explicit visual clues. For example, as shown in Figure 4 (a), although GPT-4V accurately identifies the woman’s attire as a doctor’s uniform at the low perception level, it declines to provide the correct high-level perception that the woman is a doctor, as it cannot be directly observed in the image. This problem has been mitigated in GPT-4o, as it gives a correct answer.

To explore whether we can motivate GPT-4V to integrate commonsense knowledge via tuning the prompt, we add an instruction as follows:

You are a helpful visio-linguistic AI assistant who answers questions in short words or phrases on visual commonsense in the images.

As shown in Table 3, we observe a significant performance improvement in high-level Yes/No tasks on both GPT-4V and GPT-4o, while the performance changes on open-source models such as DeepSeek-VL-7B and LLaVA-1.5-7B are negligible. This implies that commonsense knowledge is essential to perform reasonable high-level perceptions, and specific designs of prompting are important to elicit this commonsense reasoning ability from closed-source models.

High-Level Low-Level
DeepSeek-VL (7B) 54.35 70.00
DeepSeek-VL (7B)+VC 54.35 70.00
Δ\Delta 0 0
LLaVA-1.5 (7B) 51.74 68.89
LLaVA-1.5 (7B) + VC 53.48 69.26
Δ\Delta +1.74 +0.37
GPT-4V 39.57 66.30
GPT-4V+VC 43.91 64.81
Δ\Delta +4.34 -1.49
GPT-4o 56.09 74.44
GPT-4o+VC 58.70 75.19
Δ\Delta +2.61 +0.75
Table 3: The effect of adding the instruction into the prompt on Yes/No questions. VC denotes adding the instruction encouraging LVLMs to use commonsense. Δ\Delta denotes the change of qAccqAcc after adding the instruction.
Refer to caption
Figure 4: Case study. We highlight the incorrect and correct part of the answer.

Gaps between Open- and Closed-source LVLMs in Recognizing Visual Details and Utilizing Commonsense Knowledge.

Although LLaVA-1.5-13B and DeepSeek-VL-7B can outperform GPT-4o on straightforward content like background (qAccqAcc of 92.42%92.42\%, 86.36%86.36\% compared to 81.82%81.82\%)222Appendix 4 demonstrates models’ performance on different categories of visual perceptions., they demonstrate worse performance on the object association perception requiring to recognize details (qAccqAcc of 50.00%50.00\%, 58.93%58.93\% compared to 66.07%66.07\%) and gesture perception requiring commonsense knowledge (qAccqAcc of 36.84%36.84\%, 31.58%31.58\% compared to 59.46%59.46\%). For instance, in Figure 4, LLaVA-1.5-13B and DeepSeek-7B respectively fail to detect the gun held by the elder man (left) and the emotion of the man (right), while GPT-4V and GPT-4o successfully identify both.

Bias in LVLMs to Prioritize Dominant Components.

One hard case in MVP-Bench requires LVLMs to comprehend an entire image based on an inconspicuous object. In Figure 3 (bottom left), all LVLMs prioritize the shopping mall setting while overlooking the gun held by the woman. We attribute this to the data homogeneity of the training images, i.e., most training data is constructed by real-world images where a shopping mall closely correlates to shopping activities, misguiding the models to ignore the presence of the gun.

Bias to perceive scenes as staged performance in GPT-4V and GPT-4o.

GPT-4V and GPT-4o tend to interpret occasional or dramatic scenes as staged images, especially when the co-occurrence frequency of visual elements is low based on commonsense knowledge. For example, in Figure 4 (bottom right), the case depicts that the behind woman is using a gun to attack another woman, while GPT-4V and GPT-4o regard this as a staged security training exercise. This suggests the over-reliance on prior commonsense knowledge of GPT-4V and GPT-4o, potentially obstructing their generalizability to understand and interpret occasional scenes and their inherent semantic meanings.

6 Conclusion

We introduce MVP-Bench, the first benchmark systematically evaluating LVLMs’ multi-level visual perception. We diagnose 12 current LVLMs and compare their various performance across perception levels and between natural-manipulated pairs. Further analysis demonstrates these models’ deficiency and the gap between closed- and open-source models. We envision follow-up work to enhance LVLMs’ ability to generate multi-level visual perception consistent with visual content.

Limitations

While constructing MVP-Bench, we generate manipulated images with Diffusion models. Although we manually filtered out the generated images not conveying a different perception compared to the source natural images, some still contain blur, inconsistencies, or distortions (e.g., three-armed persons or blur distorted faces), potentially affecting LVLMs’ understanding due to the introduced noise. Besides, MVP-Bench focuses human-related visual perception to ensure each case necessitates multi-level understanding, potentially overlooking scenarios devoid of humans. In future work, we will refine and expand MVP-Bench further to enhance image quality and topic coverage.

Ethic Statement

MVP-Bench contains violent content and celebrity information, which may cause harmful imitation or misinformation. To prevent the misuse of MVP-Bench, we will implement stringent access rules and consistently track follow-up works to ensure their research-only objectives.

Appendix A Cases of our definition of high- and low-level visual perception in MVP-Bench

We define 5 high-level categories and 13 low-level categories for visual perception in MVP-Bench. Here are more cases from MVP-Bench for each category.

Refer to caption
Figure 5: Cases for ‘Behaviour-Background’ and ‘Behaviour-Movement’ categories.
Refer to caption
Figure 6: Cases for ‘Behaviour-Object Association’ and ‘Behaviour-Content’ categories.
Refer to caption
Figure 7: Cases for ‘Role-Attire’ and ‘Role-Group Association’ categories.
Refer to caption
Figure 8: Cases for ‘Role-Virtual Character’ and ‘Identity-Physical Feature’ categories.
Refer to caption
Figure 9: Cases for ‘Identity-Celebrity’ and ‘Identity-Gesture’ categories.
Refer to caption
Figure 10: Cases for ‘Emotion-Facial Expression’ and ‘Scenario-Background’ categories.
Refer to caption
Figure 11: Cases for ‘Scenario-Aesthetic Feature’ category.

Appendix B LVLMs’ pAccpAcc on different categories of visual perceptions

\multirow3*Method Behaviour Role Identity Emotion Scenario
\cmidrule(rl)2-5 \cmidrule(rl)6-8 \cmidrule(rl)9-10 \cmidrule(rl)11-12 \cmidrule(rl)13-14 B1B_{1} B2B_{2} B3B_{3} B4B_{4} R1R_{1} R2R_{2} R3R_{3} I1I_{1} I2I_{2} E1E_{1} E2E_{2} S1S_{1} S2S_{2}
MiniCPM-2 (3B) 86.36 55.36 42.22 56.10 75.68 70.18 65.00 57.69 45.45 31.58 62.75 84.62 75.00
DeepSeek (1.3B) 81.82 58.93 46.67 51.22 67.57 68.42 60.00 46.15 31.82 31.58 58.82 69.23 64.29
DeepSeek (7B) 86.36 53.57 44.44 63.41 75.68 73.68 60.00 57.69 45.45 28.95 58.82 92.31 75.00
MiniGPT4 (8.2B) 13.64 19.64 17.78 4.88 18.92 19.30 15.00 7.69 9.09 15.79 13.73 7.69 14.29
MiniGPT-v2 (8.2B) 68.18 53.57 44.44 29.27 62.16 59.65 60.00 34.62 36.36 26.32 23.53 53.85 50.00
InstructBLIP (7B) 74.24 57.14 28.89 36.59 51.35 47.37 70.00 34.62 50.00 15.79 25.49 76.92 35.71
InstructBLIP (13B) 69.70 41.07 31.11 31.71 32.43 59.65 60.00 42.31 40.91 23.68 33.33 61.54 35.71
LLaVA-1.5 (7B) 80.30 58.93 53.33 60.98 70.27 68.42 50.00 50.00 22.73 47.37 60.78 92.31 57.14
LLaVA-1.5 (13B) 92.42 50.00 51.11 51.22 62.16 71.93 70.00 46.15 50.00 36.84 50.98 69.23 60.71
GPT-4V 74.24 51.79 40.00 56.10 75.68 66.66 60.00 42.31 4.55 52.63 41.18 76.92 71.43
GPT-4o 81.82 66.07 51.11 60.98 72.97 71.93 80.00 65.38 34.78 59.46 51.92 83.33 92.86
Table 4: Models’ performance on different categories of visual perceptions. The denotions of different categories are consistent with the definition in Figure 3 (a). We highlight the models with highest performance on each metric.