SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Abstract
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.
1 Introduction


In recent years, Large Language Models (LLMs) [1, 2, 3, 4, 5] have exhibited remarkable capabilities to understand, reason, and generate texts across a variety of open-ended tasks. Leveraging the strong generality of LLMs, generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] have demonstrate enhanced abilities for multimodal comprehension and generation. However, current MLLMs mainly evaluate their performance with a limited number of qualitative examples, or by employing previous benchmarks that are not tailored for evaluating MLLMs with open-form output. For example, in VQAv2 [22], an answer is considered correct only if the model’s output exactly matches the groundtruth answer, which typically consists of just one or two words. The lack of a comprehensive and objective benchmark to evaluate MLLMs poses a significant challenge for comparing and investigating the performance of various models.
Concurrent works [23, 24, 25, 26] have made efforts to develop benchmarks for specifically evaluating MLLMs as shown in Table 1. For example, LVLM-eHub [25] and LAMM [24] utilize exiting public datasets across various computer vision tasks as evaluation samples, and employ human annotators or GPT to assess the quality, relevance, and usefulness of model’s predictions. However, the involvement of human and GPT during evaluation not only compromises efficiency, but also leads to increased subjectivity and reduced accuracy of the assessment. MME [23] and MMBench [26] further advance objective evaluation of MLLMs by constructing True/False Questions or Multiple-Choice Questions, which cover a variety of ability dimensions. Restricting the model’s output to True/False or A/B/C/D options facilitates the convenient computation of accuracy, which serves as an objective metric for evaluation. However, the relatively small scale of these benchmarks (fewer than 3K samples) introduces instability in the evaluation statistics.
In this work, we focus on evaluating the generative comprehension capability of MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench***In pursuit of Artificial General Intelligence (AGI), LLMs have witnessed substantial progress. We have made a bold assumption that the premise for the emergence of multimodal capabilities is to unify both comprehension and generation within an autoregressive generative model, where SEED [18] takes a modest step. Besides the exploration of models, it is essential to have appropriate evaluations that motivate research directions. Therefore, we concurrently propose SEED-Bench to evaluate the comprehension ability of generative models.. SEED-Bench spans 12 evaluation dimensions across both image and video modalities as shown in Fig. 1. SEED-Bench consists of 19K multiple choice questions with groundtruth answers derived from human annotation (9 larger than MME and 6 larger than MMBench) as shown in Fig. 2. We design a sophisticated pipeline for the generation of multiple-choice questions that are tailored to evaluate specific dimensions. We further incorporate automated filtering mechanism and manual verification process to ensure the quality of questions and the accuracy of groundtruth answers.
Specifically, for images, we utilize various foundation models to extract their visual information including image-level captions [6, 27], instance-level descriptions [28, 29, 30] and textual elements [31]. For videos, we leverage the original human annotations to provide visual information. We then feed the visual information to ChatGPT/GPT-4 with specially designed prompts corresponding to specific evaluation dimension. ChatGPT/GPT-4 subsequently generates questions as well as four candidate options with one groundtruth answer. We further filter out questions that can be answered without the visual input through utilizing multiple LLMs. Finally, we employ human annotators to choose the correct option of each multiple-choice question and classify each question into one evaluation dimension, resulting in a clean and high-quality benchmark containing 19K multiple-choice questions. Our pipeline supports the scalability of evaluation data across multiple domains, and we will continue to expand the benchmark with more evaluation dimensions.

Benchmark | Visual Modality | Customized Question | #Answer Annotation | Answer Type | H/G Evaluation | #Models |
MME [23] | Image | ✓ | 2194 | Y/N | N/A | 10 |
LAMM [24] | Image & Point cloud | ✗ | - | free-form | GPT | 4 |
LVLM-eHub [25] | Image | ✗ | - | free-form | Human | 8 |
MMBench [26] | Image | ✓ | 2974 | free-form | GPT | 14 |
Ours | Image & Video | ✓ | 19242 | A/B/C/D | N/A | 18 |
Based on SEED-Bench, we comprehensively evaluate 18 models including LLMs, ImageLLMs and VideoLLMs across all 12 dimensions as shown in Fig. 1. Different from MMBench [26] that employs ChatGPT to match a model’s prediction to one of the choices in a multiple-choice question (achieves only 87.0% alignment rate), we follow GPT-3 [32] to calculate log-likelihood for each candidate option and select the one with the highest value as the final prediction, without relying on the instruction-following capabilities of models to output “A” or “B” or “C” or “D”. By analyzing the results across 12 dimensions, we conduct a comprehensive comparison of existing multimodal models in both spatial and temporal understanding capabilities. We observe that the majority of MLLMs still exhibit limited performance across all 12 evaluation dimensions, and surprisingly find that VideoLLMs fail to achieve competitive performance on temporal understanding compared with ImageLLMs. Through the evaluation results, we aim for SEED-Bench to provide insights for motivating future exploration of a more advanced MLLM. We will launch an evaluation platform and consistently maintain a leaderboard for assessing and comparing model performance.
2 Related Work
Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a comprehensive quantitative evaluations of these models to thoroughly assess and compare their performance in generative comprehension.
Benchmarks for Multimodal Large Language Models. With the rapid development of Multimodal Large Language Models (MLLMs), some concurrent works [23, 24, 25, 26] propose various benchmarks for evaluating MLLMs. For example, GVT [33] constructs a benchmark by aggregating two semantic-level understanding tasks (VQA and Image Captioning) and two fine-grained tasks (Object Counting and Multi-class Identification). But its evaluation is constrained to limited aspects of visual understanding. LVLM-eHub [25] combines multiple existing computer vision benchmarks and develops an online platform, where two models are prompted to answer a question related to an image and human annotators are employed to compare the predictions of models. The involvement of human annotators during evaluation not only introduces bias but also incurs significant costs. LAMM [24] evaluates image and point cloud tasks by using entity extraction to obtain key answers from open-form predictions and utilizing GPT to evaluate the answers’ relevance and accuracy to the groundtruth. The reliance on entity extraction and GPT metric can impact the accuracy and reliability of the evaluation. MME [23] and MMBench [26] aim to enhance the objective evaluation of MLLMs by constructing 2914 True/False Questions and 2974 Multiple Choice Questions across a variety of ability dimensions respectively. Considering the relatively small scale of these benchmarks, their evaluation results may exhibit instability. In this work, we introduce SEED-Bench to provide objective and comprehension evaluation of MLLMs, which contains 19K multiple-choice questions covering 12 evaluation dimensions including both spatial and temporal understanding.
3 SEED-Bench
Our benchmark contains 19K multiple-choice questions with accurate human annotations spanning 12 evaluation dimensions including both the spatial and temporal understanding. In this section, we first present the evaluation dimensions of SEED-Bench in Sec. 3.1. We introduce the data source in Sec. 3.2 and our pipeline for constructing multiple-choice questions in Sec. 3.3. We finally describe the evaluation strategy for MLLMs to answer multiple-choice questions in Sec. 3.4.
Evaluation Dimensions | Sample Questions | |
Spatial Understanding | 1. Scene Understanding | What is the weather like in the image? A. It’s a sunny day B. It’s foggy C. It’s raining heavily D. It’s a cloudy day |
2. Instance Identity | What kind of animal is visible in the image? A. Horse B. Cow C. Sheep D. Goat | |
3. Instance Attribute | What is the material of the table? A. Marble B. Wood C. Glass D. Plastic | |
4. Instance Location | Where is the dog located in the living room? A. On the fireplace B. On the table C. On the chair D. On the rug | |
5. Instance Counting | How many people are there in the image? A. 1 B. 2 C. 4 D. 3 | |
6. Spatial Relation | What is the tree in relateion to the house? A. In front of the house B. Behind the house C. Inside the house D. Left to the house | |
7. Instance Interaction | What is the relation between a player and a referee? A. The player is shaking hands with a referee B. The player is arguing with a referee C. The player is receiving an award from a referee D. The player is shown a card by a referee | |
8. Visual Reasoning | what can we infer about the situation? A. They are admiring the engine B. They are experiencing car trouble C. They are having a picnic D. They are washing the car | |
9. Text Recognition | What is the main warning on the sign? A. Do not enter B. Dead end road C. Beware of bears D. Trail closed | |
Temporal Understanding | 10. Action Recognition | What is the action being carried out in the video? A. Throwing something in the air and letting it fall B. Throwing something in the air and catching it C. Lifting up one end of something, then letting it drop down D. Poking something so that it falls over |
11. Action Prediction | What action do you anticipate following the end of this video? A. Stir potatoes B. Wash potatoes C. Add potatoes D. Slice potatoes | |
12. Procedure Understanding | Can you recognize the actions in this video and list them in order? A. Cook breakfast, switch stove on, close fridge, carry milk, peel banana B. Scoop ice cream, squeeze chocolate syrup, pour sprinkles, close fridge C. Close fridge, carry milk, screw open milk cap, pour milk, screw close milk cap D. Reach for cereal box, grab bowl, pour milk, stir cereal, close fridge |
3.1 Evaluation Dimensions
In order to comprehensively assess the visual understanding capability of MLLMs, SEED-Bench incorporates 12 evaluation dimensions including both the spatial and temporal comprehension as shown in Table 2.
Spatial Understanding. For the evaluation of spatial comprehension, we consider 9 dimensions covering image-level and instance-level perception and reasoning.
-
•
Scene Understanding. This dimension focuses on the global information in the image. Questions can be answered through a holistic understanding of the image.
-
•
Instance Identity. This dimension involves the identification of a certain instance in the image, including the existence or category of a certain object in the image. It evaluates a model’s object recognition capability.
-
•
Instance Attributes. This dimension is related to the attributes of an instance, such as color, shape or material. It assesses a model’s understanding of an object’s visual appearance.
-
•
Instance Location. This dimension concerns the absolute position of one specified instance. It requires a model to correctly localize the object referred to in the question.
-
•
Instances Counting. This dimension requires the model to count the number of a specific object in the image. This requires the model to understand all objects, and successfully count the referred object’s instances.
-
•
Spatial Relation. This dimension asks an model to ground the two mentioned objects, and recognize their relative spatial relation within the image.
-
•
Instance Interaction. This dimension requires the model to recognize the state relation or interaction relations between two humans or objects.
-
•
Visual Reasoning. This dimension evaluates if a model is able to reason based on the visual information. This requires the model to fully understand the image and utilize its commonsense knowledge to correctly answer the questions.
-
•
Text Understanding. For this dimension, the model should answer question about the textual elements in the image.
Temporal Understanding. For the evaluation of temporal comprehension, we consider 3 dimensions focusing on the recognition, prediction and procedure understanding of actions.
-
•
Action Recognition. In this dimension, the model is required to recognize the action shown in the videos. Not only the ability of capture temporal dynamics, but also the knowledge of physical motions, human actions and dynamic interaction between objects is evaluated.
-
•
Action Prediction. The target of this dimension is to predict the future action through the preceding video segment, which requires the understanding of contextual information from videos and temporal reasoning.
-
•
Procedure Understanding. This dimension requires the model to capture all the key actions and perform temporal ordering on them. We aims to evaluate the ability of temporally fine-grained understanding and procedure reasoning.

3.2 Data Source
To create a benchmark with various evaluation dimensions, we need to collect data containing images with abundant visual information and videos with rich temporal dynamics, so that we can construct diverse challenging multiple-choice questions. In SEED-Bench, we use CC3M [34] dataset with filtered samples to build questions for spatial understanding. Specifically, considering the noisy original captions of CC3M, we generate captions for each image with Tag2Text [27]. We filter out those images with no more than 5 nouns in their captions, so as to ensure the information richness in the remaining images for constructing questions.
We further adopt Something-Something-v2 (SSV2) [35], Epic-kitchen 100 [36] and Breakfast [37] dataset to build questions for temporal understanding. SSV2 is an action recognition dataset including 174 fine-grained categories of basic actions with everyday objects and we adopt 1740 videos from its validation set. We also select 138 long videos from Epic-kitchen 100 dataset with temporally annotated action labels. Moreover, videos and fine-grained action segmentation annotations in Breakfast dataset [37] are utilized for the procedure understanding task.
3.3 Multiple-Choice Questions
As shown in Fig. 3, our pipeline for generating multiple-choice questions involves question/answer generation and verification. For generating question/answer pairs, we first leverage various foundation models to extract visual information including image-level captions, instance-level descriptions and textual elements. Based on specially designed prompts corresponding to specific evaluation dimension, ChatGPT/GPT-4 subsequently generates questions and four candidate options with one groundtruth answer. For verifying question/answer pairs, we filter out questions that can be answered correctly by multiple LLMs without resorting to visual information. We further employ human annotators to select the correct option and classify each question into one evaluation dimension.
Visual Information Extraction. For constructing questions related to spatial understanding, we interpret the rich information in each image with texts using multiple pretrained models, so that ChatGPT/GPT-4 can understand the image and create questions accordingly. For constructing questions related to temporal understanding, considering that extracting reliable temporal information from videos (especially fine-grained actions and long-term temporal context) is extremely difficult given existing foundation models, we utilize the ground-truth annotations of video datasets. We will explore how to generate questions based on automatically extracted video information in the future. The extraction of visual information for images includes the following parts:
-
•
Image Captions. Image captions contain the overall description of an image. We employ BLIP2 [38] and Tag2Text [27] to create captions for each image. The former creates captions for the whole image while the latter generates captions based on descriptions of each instance. The two models complement each other to depict the image content within a single sentence.
-
•
Instance Descriptions. Besides captions which may ignore specific details in the image, we also extract visual information from images using instance-level descriptions, including object detection, attribute detection, and dense captions. Specifically, we use SAM [29] to segment each instance in the image and obtain their bounding boxes according to the segmentation results. The object labels are obtained using Tag2Text [27]. Besides, we also utilize attribute detector [30] to obtain the attributes of each instance in the image. Finally, we employ GRiT [28] to generate dense captions, which describe each detected instance in the image with a short sentence. These instance-level descriptions are complementary to the image captions, further enriching the visual information of each image.
-
•
Textual Elements. Besides objects, the texts in the image also contain important information describing the image. We employ PaddleOCR [31] for detecting textual elements.
Question-Answer Generation. After extracting visual information from the image and video, we task ChatGPT/GPT-4 with generating multiple-choice questions based on the extracted information or video annotations. For each of the spatial understanding evaluation, we carefully design prompts and ask ChatGPT/GPT-4 to create multiple choice questions with four candidate options based on the extracted visual information. We create questions with ChatGPT for all evaluation dimensions, except for the reasoning dimension, where we use GPT-4 [2] due to its exceptional reasoning capability. For each question, we ask ChatGPT/GPT-4 to create four choices with one correct option and three distractors. We try to make the multiple-choice questions challenging by encouraging the three wrong choices to be similar to the correct one. The detailed prompts of generating multiple-choice questions for different evaluation dimensions are listed in Fig. 4. For generating questions related to temporal understanding, we utilize the ground-truth annotations of selected videos as the answer of multi-choice questions and employ ChatGPT to generate three distractors.

Automatic Filtering. Our benchmark aims at evaluating the multimodal vision-language understanding capability of MLLMs. However, we observe that some generated questions can be correctly answered by LLMs without seeing the image. We argue that such questions are not helpful to evaluate the visual comprehension capability of MLLMs. To this end, we feed the generated questions (without image) into three powerful LLMs, including Vicuna-7B [4], Flan-T5-XXL [1] and LLaMA-7B [5] and ask them to answer the questions. We empirically found that of the generated questions can be correctly answered by all of the three LLMs. We filter out these questions from our benchmark.
Human Annotation. To ensure the accuracy and objectiveness of SEED-Bench, we further employ human annotators to verify the generated question/answer pairs. Human annotators are asked to choose the correct answer for each multiple-choice question and categorize each question into one of the evaluation dimension. If one question can not be answered based on the visual input or does not have any correct choice or has multiple correct choices, it will be discarded by human annotators. This results in a clean, high-quality and well-categorized benchmark for evaluation with a total of 19K multiple-choice questions. The statistics of the number of multiple-choice questions in each evaluation dimension is shown in Fig. 1. We can observe a minimum number of questions in text recognition with 85 samples, and a maximum number in instance localization with 4649 samples. We will maintain an even distribution among multiple-choice questions associated with different evaluation dimensions in the future.
3.4 Evaluation Strategy
Different from MMBench [26] that employs ChatGPT to match a model’s prediction to one of the choices in a multiple-choice question (achieves only 87.0% alignment rate), we adopt the answer ranking strategy [10, 32, 39] for evaluating existing MLLMs with multiple-choice questions. Specifically, for each choice of a question, we compute the likelihood that an MLLM generates the content of this choice given the question. We select the choice with the highest likelihood as model’s prediction. Our evaluation strategy does not rely on the instruction-following capabilities of models to output “A” or “B” or “C” or “D”. Furthermore, this evaluation strategy eliminates the impact of the order of multiple-choice options on the model’s performance.
4 Evaluation Results
Model Type | Model | Language Model | Spatial | Temporal | Overall | |||
Acc | Rank | Acc | Rank | Acc | Rank | |||
LLM | Flan-T5 [1] | Flan-T5-XL | 27.32 | 17 | 28.56 | 11 | 27.65 | 17 |
Vicuna [4] | Vicuna-7B | 28.16 | 16 | 29.46 | 8 | 28.50 | 16 | |
LLaMA [5] | LLaMA-7B | 26.56 | 18 | 27.27 | 13 | 26.75 | 18 | |
ImageLLM | BLIP-2 [6] | Flan-T5-XL | 49.74 | 3 | 36.71 | 3 | 46.35 | 3 |
InstructBLIP [10] | Flan-T5-XL | 57.80 | 2 | 38.31 | 1 | 52.73 | 2 | |
InstructBLIP Vicuna [10] | Vicuna-7B | 58.76 | 1 | 38.05 | 2 | 53.37 | 1 | |
LLaVA [8] | LLaMA-7B | 36.96 | 8 | 23.75 | 16 | 33.52 | 9 | |
MiniGPT-4 [7] | Flan-T5-XL | 47.40 | 4 | 29.89 | 7 | 42.84 | 4 | |
VPGTrans [40] | LLaMA-7B | 41.81 | 5 | 31.40 | 5 | 39.10 | 5 | |
MultiModal-GPT [12] | LLaMA-7B | 34.54 | 12 | 29.21 | 10 | 33.15 | 11 | |
Otter [11] | LLaMA-7B | 35.16 | 11 | 30.35 | 6 | 33.91 | 8 | |
OpenFlamingo [41] | LLaMA-7B | 34.51 | 13 | 29.25 | 9 | 33.14 | 12 | |
LLaMA-Adapter V2 [42] | LLaMA-7B | 35.19 | 10 | 25.75 | 14 | 32.73 | 13 | |
GVT [33] | Vicuna-7B | 35.49 | 9 | 27.77 | 12 | 33.48 | 10 | |
mPLUG-Owl [9] | LLaMA-7B | 37.88 | 7 | 23.02 | 18 | 34.01 | 7 | |
VideoLLM | VideoChat [15] | Vicuna-7B | 39.02 | 6 | 33.68 | 4 | 37.63 | 6 |
Video-ChatGPT [16] | LLaMA-7B | 33.88 | 14 | 23.46 | 17 | 31.17 | 14 | |
Valley [17] | LLaMA-13B | 32.04 | 15 | 25.41 | 15 | 30.32 | 15 |
4.1 Models
Based on our SEED-Bench, we evaluate 18 models including 3 LLMs, i.e., Flan-T5 [1], Vicuna [4], LLaMA [5], 12 ImageLLMs, i.e., OpenFlamingo [41], BLIP-2 [6], MiniGPT-4 [7], LLaVa [8], mPLUG-Owl [9], InstructBLIP [10], Otter [11], MultimodalGPT [12], GVT [33], PandaGPT [13], VPGTrans [40], LLaMA-Adapter V2 [42], and 3 VideoLLMs, i.e., VideoChat [15], Video-ChatGPT [16] and Valley [17]. Each model is evaluated with all the 12 dimensions including both the spatial and temporal understanding. For ImageLLMs, besides the evaluation of spatial understanding, we aim to investigate their capability to perform temporal reasoning among multiple frames. For VideoLLMs, we seek to explore whether their spatial understanding abilities have degraded by taking a single image as the input.

4.2 Results
The evaluation results of different models on SEED-Bench are listed in Table. 1, where the accuracy refers to the proportion of correctly answered multiple-choice questions relative to the total number of questions. We are surprised to observe that InstructBLIP [10] not only achieves the best performance based on the averaged results across nine dimensions for evaluating spatial understanding, but also surpasses VideoLLMs in terms of the averaged results across three dimensions for evaluating temporal understanding. We display leaderboards of various evaluation dimensions on SEED-Bench in Fig. 5 to provide a comprehensive assessment of different models. The overall leaderboard based on the averaged results across all the evaluation dimensions are shown in Fig. 1. To better showcase the the capabilities of models across different evaluation dimensions, we further visualize the ranking of each model within each evaluation dimension in Fig. 6, where darker colors represent higher ranks. We can observe that the BLIP series [6, 10] model achieves competitive results in multiple evaluation dimensions, but they are not good at visual reasoning and action recognition. VideoLLM Valley [17] achieves suboptimal performance in the majority of evaluation dimensions. LLaVa [8] exhibits unparalleled capabilities in the evaluation of text recognition compared to other evaluation dimensions. In terms of specific evaluation dimension, MiniGPT-4 [7] model and mPLUG-Owl [9] model performs better in visual reasoning, while VPGTrans [40] model excels in action recognition and procedure understanding. LLaMA Adapter V2 [42] model shows more proficiency in action recognition. What’s more, Multimodal GPT [12], Otter [11], Openflamingo [41], GVT [33], and the three VideoLLMs [15, 16, 17] exhibit balanced strength across various evaluation dimensions.
4.3 Analysis
Through the comprehension and objective evaluation of various models on SEED-Bench, we have observed a number of findings that can bring insights for future work.

Most MLLMs still exhibit limited performance across all 12 evaluation dimensions. As shown in Fig. 1, 5, most MLLMs (except BLIP series models) can not reach 50% accuracy on both average performance and the performance on more than three single evaluation dimension. In some specific evaluation dimension (e.g., visual reasoning), it seems that most MLLMs achieve high accuracy. However, when comparing the performance of MLLMs to LLMs, we observe that the performance improvement of most MLLMs is still relatively limited.
MLLMs achieve relatively high performance on global image comprehension On the evaluation of scene understanding and visual reasoning, the accuracy of most MLLMs is higher than 40%, and all MLLMs outperforms LLMs. This shows that MLLMs are more proficient in global understanding and reasoning of images, compared with other evaluation dimensions that require fine-grained instance-level comprehension.
InstructBLIP achieves top performance on 8 of 12 evaluation dimensions. We can observe that InstructBLIP outperforms other models on 8 evaluation dimensions and the possible explanations for this superior performance are as follows. (a) The instruction-tuning data of InstructBLIP contains totally 16M samples (larger than other instruction-tuning datasets), and covers a wide range of multimodal tasks, even including QA data of OCR and temporal visual reasoning. (b) The weights of LLMs are frozen when performing instruction-tuning of InstructBLIP, which may alleviate catastrophic forgetting. However, InstructBLIP series models still perform poorly on action recognition and procedure understanding that differ significantly from the instruction-tuning data. For instance, on action recognition that requires the understanding of fine-grained actions in Something-Something-v2, InstructBLIP series models can not achieve significant performance gain compared to LLMs (i.e., lower than 2%). This indicates that InstructBLIP series models may fail to generalize well on the out-of-distribution data.
MLLMs show weaker abilities in understanding spatial relationships between objects. The top-ranked model InstructBLIP only achieves 40% accuracy on the evaluation of spatial relations, which shows that recognizing relative spatial relationships between instances is challenging because there can be many possible arrangements and combinations of spatial relationships between instances. Additionally, spatial relationships between objects may cause ambiguity in some cases, making it difficult to determine their relationship.
Most MLLMs show poor performance for text recognition. Apart from InstructBLIP, all other models achieve an accuracy lower than 40% for text recognition due to the lack of textual elements in multimodal pre-training datasets. Since the ability to accurately identify and extract text from images is important, future work should develop models that are better equipped to handle text recognition by pre-training on datasets with rich textual elements in visual data.
VideoLLMs achieve promising results on spatial understanding. For example, VideoChat achieves 39.98% accuracy (ranking 4-th on instance localization, surpassing LLaVa by 11.55% and performing only 3.58% lower than the top-1 model. It shows that VideoChat’s ability of spatial understanding does not degrade by jointly training on both image and video data during the pre-training and instruction-tuning stages.
Most MLLMs exhibit unsatisfactory performance on fine-grained temporal understanding. It is notable that on the evaluation of procedure understanding, the top-ranked model, VPGTrans, achieves an accuracy that is only 5% higher than that of LLaMA. The performance improvement of the following 4 MLLMs is even less than 1.2% compared with LLaMA. This demonstrates that it is extremely difficult for both the ImageLLMs and VideoLLMs to perform fine-grained temporal reasoning so that they can recognize and sort the key actions in a video.
VideoLLMs fail to achieve competitive performance on temporal understanding. Although VideoLLMs are instruction-tuned on video data, they do not exhibit a significant advantage on evaluation dimensions for temporal understanding. Surprisingly, two VideoLLMS (Video-ChatGPT and Valley) even perform worse than most ImageLLMs on action recognition, action prediction and procedure understanding. It indicates that the capabilities of existing VideoLLMs for fine-grained action recognition, temporal relationship understanding and temporal reasoning are still limited. Similar concerns about existing VideoLLMs are also presented in recent works [15, 16].
5 Conclusion
In this work, we propose a large-scale benchmark SEED-Bench to provide a comprehensive and objective evaluation of Multimodal Large Language Models (MLLMs) on generative comprehension. SEED-Bench consists of 19K multiple-choice questions with accurate human annotations, which covers 12 evaluation dimensions for both the spatial and temporal understanding. We design an advanced pipeline to create multiple-choice questions that target specific evaluation dimensions, facilitating the scalability of evaluation data across a variety of domains. We also integrate automatic filtering and manual verification to improve the quality of the generated questions and answers. We conduct a thorough evaluation of 18 models, analyzing and comparing their performances to provide insights for future research. We plan to launch and consistently maintain a leaderboard, offering a platform for the community to assess model performance. We will continue to further broadening the evaluation dimensions of SEED-Bench with more data.
Acknowledgements
We sincerely acknowledge Junting Pan (CUHK MMLab) for the insightful suggestions, Zhan Tong (Nanjing University) for the data processing, and Yi Chen (Tencent AI Lab) for the engaging discussions.
References
- [1] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- [2] OpenAI. Gpt-4 technical report, 2023.
- [3] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- [4] FastChat. Vicuna. https://github.com/lm-sys/FastChat, 2023.
- [5] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [6] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023.
- [7] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- [8] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- [9] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- [10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- [11] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
- [12] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023.
- [13] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- [14] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- [15] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- [16] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- [17] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
- [18] Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
- [19] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
- [20] Yu Lili, Shi Bowen, Pasunuru Ram, Miller Benjamin, Golovneva Olga, Wang Tianlu, Babu Arun, Tang Binh, Karrer Brian, Sheynin Shelly, Ross Candace, Polyak Adam, Howes Russ, Sharma Vasu, Xu Jacob, Singer Uriel, Li (AI) Daniel, Ghosh Gargi, Taigman Yaniv, Fazel-Zarandi Maryam, Celikyilmaz Asli, Zettlemoyer Luke, and Aghajanyan Armen. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. 2023.
- [21] Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
- [22] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- [23] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- [24] Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
- [25] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- [26] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
- [27] Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
- [28] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
- [29] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
- [30] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021.
- [31] https://github.com/PaddlePaddle/PaddleOCR. Paddleocr.
- [32] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- [33] Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, and Ying Shan. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023.
- [34] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- [35] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- [36] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020.
- [37] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
- [38] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- [39] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- [40] Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. abs/23045.01278, 2023.
- [41] ml_foundations. Openflamingo. https://github.com/mlfoundations/open_flamingo, 2023.
- [42] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.