This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LEO: Boosting Mixture of Vision Encoders for
Multimodal Large Language Models

Mozhgan Nasr Azadani  James Riddell  Sean Sedwards  Krzysztof Czarnecki
University of Waterloo
https://github.com/Mozhgan91/LEO
Corresponding author: [email protected]
Abstract

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

1 Introduction

Advancements in multimodal large language models (MLLMs) [21, 2, 12, 29] have harnessed the strengths of pre-trained large language models (LLMs) alongside powerful vision encoders. These models are trained through multiple stages on large-scale image-text datasets, which effectively aligns visual tokens extracted from vision foundation models, such as CLIP [38], with the latent space of LLMs. This multi-stage alignment has enabled advancements in tasks involving vision-language comprehension and reasoning [28, 2, 10]. Nonetheless, due to limitations in input resolution, arising from the constraints of current vision encoders and language model sequence lengths, their effectiveness is reduced in tasks requiring high visual detail, such as complex optical character recognition (OCR) and chart understanding. Enhancing visual understanding is essential for minimizing hallucinations [39] and improving performance in tasks that require high-resolution.

Refer to caption
Figure 1: Comparison of the performance of LEO across diverse vision-language tasks with recent approaches [7, 3, 27, 11, 33].
Refer to caption
Figure 2: Top: Comparison between the fusion strategy of existing hybrid MLLMs and that of LEO. Bottom: The most common fusion paradigms in the literature: (1) channel concatenation [40], (2) sequence concatenation [16], (3) MR-adapter [33], and (4) cross-attention [23].

To this aim, recent studies [25, 28, 23, 16, 11, 19] have attempted to improve MLLMs by exploring methods that enhance their visual understanding capabilities. Some works [50, 7, 28, 24, 6] demonstrate that strengthening vision encoders, either by scaling up model parameters and pretraining data to match those of LLMs or by employing tile segmentation, can significantly improve detailed scene understanding. However, these approaches often come with increased computational burden, particularly when multiple images are processed. This has prompted the development of a new class of models that exploit multiple vision encoders, each pretrained for distinct vision tasks and input resolutions, and integrated using a variety of fusion techniques, including but not limited to sequence concatenation [16, 26, 11], channel concatenation [40, 31], mixture-of-resolution adaptation [33], and cross-attention [23], to enhance their ability to process complex multimodal data. Despite their success, these prior studies have primarily focused on pre-adaptation fusion, leaving tile-level post-adaptation fusion unexplored in effectively fusing visual tokens from diverse vision encoders.

In this work, we introduce LEO, a novel MLLM designed to enhance the integration of multiple vision encoders for multimodal language tasks. Unlike existing hybrid models, LEO adopts a unique post-adaptation fusion strategy combined with tile segmentation (see Fig. 2, top). Specifically, each input image is divided into 448×448448\times 448 tiles, which are processed independently by our dual-branch vision encoder architecture. This allows each encoder to exploit its specialized capabilities for optimal image processing. Our method follows a standard ViT-projector-LLM framework, using separate MLP projectors for each vision encoder branch. To further optimize the visual token representation, LEO employs pixel unshuffling to reduce the number of tokens for both encoders. These visual tokens are then sequentially interleaved and merged with text tokens before being fed into the LLM. By employing tile-level post-adaptation fusion strategy for visual token combination, LEO not only outperforms models with single vision encoders and those that rely on higher image resolutions, but also demonstrates superior performance over existing hybrid MLLMs with pre-adaptation fusion techniques. This is demonstrated by our quantitative analysis, illustrated in Fig. 1. Our contributions are summarized as follows:

  • We propose LEO, a powerful hybrid multimodal model that strategically fuses the capabilities of two visual encoders through a tile-level post-adaptation fusion methodology to enhance the visual understanding capabilities of MLLMs.

  • We conduct extensive experiments across multiple general benchmarks, thus demonstrating the superior performance of LEO on the majority of tasks when compared to leading open-source single-vision-encoder and hybrid models.

  • We adapt LEO to the specialized domain of autonomous driving (AD) without altering its architecture, collecting extensive domain-specific pretraining data, or customizing the training recipe. To the best of our knowledge, this is the first exploration of hybrid MLLMs for AD.

2 Related work

2.1 Multimodal LLMs

With the rapid advancement of large language models [46, 1, 43, 8], there has been considerable interest in multimodal models that enhance understanding and reasoning capabilities. BLIP2 [21] introduces the Q-Former, designed to efficiently bridge the modality gap between images and text during pretraining. Flamingo [2] is capable of processing sequences that combine visual and textual data in any arbitrary order, a crucial feature that enables them to perform in-context few-shot learning effectively. LLaMA-Adapter v2 [12] activates a larger number of learnable parameters in the LLM through an early fusion strategy for vision and text tokens, where vision tokens are fed only into the initial layers of the language model, enabling more efficient integration of visual information. LLaVA [29, 27] delivers impressive results by employing a straightforward projector, such as a linear layer or MLP, between the visual and language components, and developing a streamlined instruction-following data process powered by GPT. Most of these methods, however, maintain low input resolutions due to the constraints of pretrained vision encoders, such as that of CLIP [38], and the sequence length restrictions of large language models. All these approaches exploit a single vision encoder in their architecture.

2.2 Vision encoders for MLLMs

To address the constraints of lower input resolutions, recent MLLMs have concentrated on enhancing their vision encoder module. From a vision-focused standpoint, these approaches can be broadly categorized into three main strategies: (1) robust vision encoders: designing stronger vision encoders [50, 7] that effectively capture complex visual features by incorporating larger models sizes and better training strategy, (2) tile segmentation: handling high-resolution inputs by dividing images into smaller, lower-resolution tiles [39, 28, 6], and (3) hybrid vision encoders: utilizing a vision backbone that incorporates multiple vision experts.

Our research is closely tied to the third group of approaches [23, 45, 31] that develop a multi-branch vision encoder framework to enhance perception capabilities of MLLMs. Some models [33, 23] suggest merging high-resolution visual details with low-resolution tokens to enhance visual representation. LLaVA-HR [33] proposes a dual-pathway vision model that integrates features from high-resolution convolutional blocks with those from low-resolution ViT blocks. These pretrained vision experts might nevertheless lack key capabilities, such as text understanding and object localization. Several studies have incorporated multiple vision experts trained on diverse tasks to broaden the functionality of their encoders. Brave [16] and Mousi [11] perform sequence append, combining vision tokens from multiple experts into a single, extended sequence. Tong et al. [45] identify distinct differences in the visual features captured by CLIP [38] and DINOv2 [37], leading to the design of an image-level mixture-of-features strategy. Some models combine vision tokens through channel concatenation to preserve the original sequence length, such as DeepSeek-VL [31] and Eagle [40] models, or utilize advanced fusion and routing techniques [19, 51] to leverage the strengths of various encoders. These methods perform pre-adaptation fusion of vision tokens, which are then processed by a single projector module (either an MLP or Q-Former), as shown in Fig. 2.

Distinguished from previous studies, this work proposes a novel hybrid approach that integrates a tile-level post-adaptation fusion strategy with dynamic high-resolution inputs achieved through tile segmentation. In our framework, each vision expert is equipped with its own projector, and the vision tokens are sequentially interleaved following vision-text alignment by the projectors at the tile level.

3 Method

Refer to caption
Figure 3: The architecture of our model. LEO adapts a dual-vision MLLM architecture through tile-level post-adaptation fusion of visual tokens. Pixel unshuffle is adapted to decrease the visual token quantity.

3.1 Overview

Figure 3 illustrates the overall architecture of our proposed multimodal large language model, LEO, which is conceptually straightforward: First, the high-resolution input images are divided into tiles. These tiles are then processed by two different vision experts, each exploiting its specialized pretraining to provide distinct feature representations. Next, pixel unshuffling is applied to the extracted visual embeddings to reduce the number of visual tokens for each vision encoder. Vision-text alignment is conducted using two distinct MLP projectors, and a tile-level post-adaptation fusion strategy is employed to sequentially interleave tile-level visual tokens. Finally, these visual tokens are combined with text tokens and processed by the LLM for comprehensive visual-language understanding and reasoning. The following section gives more details of the architectural blocks identified in Fig. 3.

3.2 LEO

Dynamic high resolution. The image processing begins with a dynamic high-resolution approach, where an input image is segmented into multiple tiles alongside a thumbnail of the original image, helping the model to capture the global context. We utilize a dynamic resolution input strategy similar to that in InternVL [7, 6]. Each input image is resized to a closest available aspect ratio that is dividable into square patches of size 448×448448\times 448, with up to six tiles for the 3:2 aspect ratio shown in Fig. 4. To capture global context, the input image is also scaled to match the patch size and added to the set of patches to be processed by the vision encoders. This approach, when applied dynamically to two input images, results in a total of 14 unique patches for training. Figure 4 illustrates this tiling process with an example driving scene image, where the tiles are shown after normalization by the SAM’s preprocessor [18].

Refer to caption
Figure 4: Tile Segmentation: Each input image is divided into multiple tiles to capture localized details, while a resized version maintains global context. The tiles are shown after preprocessing with the SAM [18] preprocessor.

Pixel unshuffling. Once the input images are segmented into tiles, they are simultaneously fed to two distinct vision encoders. We select InternViT-300M-448px [7] as the first vision encoder and SAM-L [18] as the second. For each encoder, we apply pixel unshuffling [41], which rearranges the spatial layout of pixels to reduce the number of visual tokens while preserving important visual features. Given an input tensor of shape [b,c,w,h][b,c,w,h] and a downscaling factor of rr, the output tensor will have the shape [b,c(r2),w/r,h/r][b,c*(r^{2}),w/r,h/r], where bb represents the number of tiles, cc is channel, and ww and hh denote width and hight. Specifically, for each segmented tile, we apply downscaling factors of 22 and 44 for our first and second vision encoders, respectively, reducing the number of visual tokens to 256256 per tile and encoder, resulting in a total of 512512 visual tokens per tile.

Visual token fusion. Existing hybrid multimodal models [33, 23, 16, 40] typically use a pre-adaptation fusion strategy to combine visual tokens from two or more visual encoders, applying one of the common fusion methods shown in Fig. 2 (bottom) prior to the vision-text alignment process. In these approaches, all visual encoders share the same projector module. In contrast, LEO employs an alternative fusion approach in which each vision encoder maintains its own dedicated projector module, allowing for independent processing of visual tokens before they are combined. We find this to be a more flexible and effective fusion strategy. We use a two-layer MLP for the projector architecture to ensure simplicity and efficiency. To streamline processing, we set the output dimension of the projector to match that of our LLM. In the fusion block, we sequentially interleave the visual tokens from InternViT and SAM-L for each segmented tile, while preserving their original order.

Language model. We combine the extracted visual and text tokens and feed them into the LLM for auto-regressive generation. For our language model, we use InternLM2-7B-Chat [4], identical to the LLM in the base model [7], to ensure a fair comparison. To optimize memory and computational efficiency, the context length of the LLM is set to a maximum of 8196 tokens, ensuring balanced performance across various multimodal tasks. Given the input context length constraint of the LLM, we limit each input image to six segmented tiles, which enables efficient handling of multiple images.

3.3 Training process.

Our training process consists of two main phases. In the first phase, we initialize the vision encoders and the language model from the base models [7, 4, 18], while the projector layers for SAM are randomly initialized. This approach is adopted because InternViT is already pretrained on vision-language alignment and SAM is pretrained specifically for segmentation. To avoid any potential representation inconsistencies, we focus on training the layers of the second projector. In the second phase, the vision encoders are kept frozen, and we perform full fine-tuning of two projectors and the language model. In Table 5, we present the results of an ablation study that shows the effect of keeping the vision encoders frozen during the training phases.

3.4 Adaption to autonomous driving

Although numerous studies have successfully applied MLLMs to autonomous driving [34, 5, 44, 47], a straightforward approach that avoids extensive modifications to model architecture, training processes, or heavy data collection has yet to be fully explored. In this work, we investigate the potential of applying LEO to the autonomous driving domain without altering its architecture or training recipe, aiming to offer insights into streamlined transfer learning and facilitate MLLM adaptation to specialized domains. Instruction tuning plays a crucial role in helping models learn to follow user prompts, utilizing training data in visual question answering and conversational formats. For this domain, we design tasks in a VQA format, with each frame represented as: <<img>> <<IMG-CONTEXT>> <</img>>. At the prompt level, the temporal aspect of video frames is managed by treating sequential frames as multiple image inputs. A sample prompt is formulated as:“<<image1>><<image N>> Is it safe to enter the intersection at this time?”. The images in this setting are of high-resolution, each measuring 2048×12802048\times 1280 and segmented into six patches of size 448×448448\times 448.

Model ChartQA DocVQA VQAT{}^{\text{T}} GQA VQAv2{}^{\text{v2}} VizWiz MMB MMMU POPE AI2D SEED SQA MMVet
Instruct-BLIP [9] - - 50.1 49.2 - 45.5 36 - - - - 60.5 26.2
InternVL [7] - - 57.0 62.9 79.3 52.5 64.6 - 86.4 - 65.4 66.2 31.2
VILA [25] - - 64.4 62.5 79.9 57.8 68.9 - 85.5 - 61.1 68.2 34.9
QwenVL [3] 65.7 65.1 63.8 59.3 79.5 - 38.2 - - 62.3 64.8 67.1 -
QwenVL-Chat [3] 66.3 62.5 61.5 57.5 78.2 - - - - 57.7 - 68.2 -
LLaVA.1.5 [27] - - 58.2 62.0 78.5 50.0 64.3 - - - - 66.8 31.1
LLaVA-Next [28] - - - - - - 67.4 35.8 86.5 - 70.2 - -
LEO (ours) 71.0 80.1 68.8 64.8 78.3 57.9 72.9 36.4 88.0 69.6 72.2 78.5 37.2
Table 1: Comparison with leading MLLMs across 13 benchmarks. All models use a 7B language model. Bolded values indicate the best performance. Some benchmark names are abbreviated due to space constraints: VQAT{}^{\text{T}}: TextVQA [42], SQA: ScienceQA [32], MMB: MMBench [30], and SEED: SEED-Bench [20].
Model Fusion PT SFT VQAT{}^{\text{T}} GQA VQAv2{}^{\text{v2}} VizWiz MMB MMMUv{}_{\text{v}} MMMUt{}_{\text{t}} POPE SEED SQA MMVet
Brave-X5 [16] Pre-A 100 M - - 52.7 82.5 54.2 - - - 87.6 - - -
LLaVA-HR [33] Pre-A 558 K 1.2 M 67.1 64.2 81.9 48.7 - - - 87.6 64.2 65.1 31.2
Mini-Gemini [23] Pre-A 1.2 M 1.5 M 65.2 64.5 - - 69.3 36.1 32.8 - - 71.1 40.8
Mousi [11] Pre-A 1.2 M 1.6 M 53.4 60.5 75.4 - 65.4 - - 85.4 62.0 71.6 29.1
LEO (ours) Post-A 595 K 1 M 68.8 64.8 78.3 57.9 72.9 36.4 33.5 88.0 72.2 78.5 37.2
Table 2: Results on 11 evaluation benchmarks are compared with leading hybrid MLLMs. All models use a 7B language model. The best values are shown in bold. X5X5 denotes a mixture of 5 vision encoders. The following names are shortened due to space constraints: Pre-A: pre-adaptation, Post-A: post-adaptation, PT: pretraining data, SFT: supervised finetuning data, VQAT{}^{\text{T}}: TextVQA [42], SQA: ScienceQA [32], MMB: MMBench [30], and SEED: SEED-Bench [20].

4 Experiments

We first describe the evaluation setting, outlining the implementation details of our model. We then present a comparative analysis of LEO against leading open-source MLLMs and hybrid models, across diverse vision-language tasks.

4.1 Setting

Implementation details. Our model architecture is built upon InternVL [7], following a standard MLLM design of ViT-projector-LLM. The projector aligns the visual features by mapping them into the language embedding space. LEO uses InternLM2-7B-Chat [4] as the large language model, with pretrained InternViT-300M-448px [7] and SAM-L [18] as vision encoders, and an adaptive tiling strategy. Following the base model [7], we divide each input image into 448×448448\times 448 tiles, where the number of tiles varies based on the aspect ratio of the image. Our training procedure consists of two stages. In the first stage, we freeze both vision encoders and focus on optimizing the projector module to ensure effective training. In the second stage, we perform supervised instruction tuning, unfreezing the projector modules along with the LLM. Both stages employ a context length of 81968196, and training is conducted for a single epoch. We optimized the model using the AdamW optimizer with a cosine learning rate schedule. During the the second stage, we set the learning rate to 4×1054\times 10^{-5} with a weight decay of 0.010.01. In the alignment stage, we increased the learning rate to 4×1044\times 10^{-4}, maintaining the same weight decay. Training was conducted on 8 A100 GPUs (80 GB each) using DeepSpeed’s Zero2 strategy, allowing training to complete within approximately 72 hours.

Training Datasets. In the first stage of training, we use the LLaVA-595k dataset [28], which comprises 595k samples. In the supervised finetuning stage, we employ the same supervised fine tuning stage dataset as InternVL [7], incorporating a total of approximately one million visual instruction tuning samples, all of which are fully open-source.

4.2 Main Results on general benchmarks

Comparison with leading MLLMs. In this section, we comprehensively evaluate our model’s visual understanding and reasoning abilities in comparison to previous leading MLLMs, across 13 vision-language benchmarks. These benchmarks are organized into three task categories: (1) OCR and chart understanding, including DocVQA [36], TextVQA [42], ChartQA [35], and AI2D [17]; (2) general visual question answering, including VQAv2{}^{\text{v2}} [13], GQA [15], and VizWiz [14]; and (3) general multimodal benchmarks, such as MMMU [49], MMBench [30], SEED-Bench [20], POPE [22], MMVet [48], and ScienceQA [32].

In Table 1, we see that LEO achieves state-of-the-art results in 12 out of 13 benchmarks. In the OCR and chart understanding category, LEO consistently surpasses leading models across all four datasets, by virtue of its dual-branch vision encoders. In the multimodal benchmark category, LEO demonstrates superior performance across all six benchmarks, highlighting its broad knowledge and advanced reasoning abilities. Additionally, the results in Table 1 show that LEO excels in more demanding benchmarks that necessitate college-level knowledge, such as MMMU [49], which focuses on complex problems from various domains, including science, business, tech and engineering, as well as health and medicine. Notably, compared to InternVL [7], which uses the same LLM and vision encoder as our model, LEO achieves superior performance across 8 out of 9 benchmarks, demonstrating the benefits of dual-branch vision encoders for vision-language tasks. This approach mitigates the inherent biases of individual vision encoders, providing a robust framework for the mixture of encoders.

Comparison with leading Hybrid MLLMs. We compare the performance of LEO with recent hybrid approaches across 11 benchmarks. In Table 2, we see that LEO demonstrates strong performance on the majority of benchmarks. Our model is trained on the least amount of data in both pretraining and SFT stages, yet it outperforms models trained on larger datasets, such as Mousi [11], highlighting the generalization capability of our model. Compared to models with more complex fusion strategies, such as LLaVA-HR [33] and Mini-Gemini [23], our model excels across most benchmarks, especially on multimodal benchmarks like MMBench, SEED, and ScienceQA. Notably, compared to Brave [16], which combines five distinct vision encoders through pre-adaptation fusion and sequence concatenation, LEO achieves competitive performance on most tasks. This result underscores that post-adaptation fusion of visual tokens from only two vision experts can be as effective as pre-adaptation fusion of visual tokens from a larger set of vision experts.

Comparison with Eagle. We conduct a comparison with Eagle [40], a concurrent work that integrates vision encoders through pre-adaptation fusion and channel concatenation. Table 3 shows that LEO outperforms Eagle-X2 [40], which combines two vision experts, on 7 out of 9 benchmarks, particularly excelling in the OCR and general VQA categories. Notably, LEO also surpasses Eagle-X4 [40], which uses four vision encoders, on 5 out of 7 benchmarks, with an identical score for GQA and a near-dentical score on POPE. It is worth mentioning that these results are achieved despite LEO being trained with less SFT data, highlighting the robustness and reasoning capability of the enhanced fusion design in LEO.

Eagle-X4 [40] Eagle-X2 [40] LEO (ours)
Fusion Pre-adapt. Pre-adapt. Post-adapt.
#-Tokens 1024 1024 512
PT 595 K 595 K 595 K
SFT 1.8 M 1.8 M 1 M
ChartQA 67.5 67.0 71.0
DocVQA - 77.7 80.1
VizWiz 50.8 48 57.9
GQA 64.8 63.2 64.8
MMMU - 36.0 36.4
SEED 73.4 73.5 72.2
MMBench 67.8 - 72.9
POPE 88.4 88.3 88.0
ScienceQA 70.4 70.7 78.5
Table 3: Results compared to a concurrent approach [40], which combines vision encoders through pre-adaptation fusion and channel concatenation. Here, XNX_{N} denotes a mixture of NN vision encoders and #-Tokens denotes number of visual tokens.
Model N Lingo-J \uparrow BLUE \uparrow METEOR \uparrow CIDEr \uparrow
BLIP 2 [21] 1 52.20 13.00 17.40 60.10
LLaVA.1.5 [27] 5 51.00 10.62 29.44 48.18
InternVL [7] 5 58.00 13.53 34.27 67.17
LingoQA [34] 3 59.80 14.61 18.44 62.61
LingoQA [34] 5 60.80 15.00 18.56 65.62
LEO (ours) 2 61.00 14.91 35.44 69.72
Table 4: Results on the LingoQA benchmark [34]. All models are fine-tuned, where NN denotes the number of frames used during training. Lingo-J represents the Lingo-Judge metric. Leo demonstrates competitive performance without requiring tailored model architecture for the autonomous driving domain.

4.3 Results in autonomous driving domain

Settings. We use the LingoQA benchmark [34] to evaluate the performance of LEO in the autonomous driving domain. LingoQA contains over 400K samples and covers various aspects of the driving process. This dataset includes data covering nine distinct task types, such as action, justification, localization, and anticipation, providing a thorough representation of scenarios encountered in autonomous driving.

InternViT-300M SAM-ViT-Large Freeze VQAT{}^{\text{T}} GQA VizWiz MMB POPE SEED SQA MMVet
\checkmark ×\times ×\times 57.0 62.9 52.5 64.6 86.4 65.4 66.2 31.2
×\times \checkmark ×\times 45.2 56.4 47.5 44.7 84.2 51.3 64.0 18.2
×\times \checkmark \checkmark 49.5 58.2 50.6 48.3 85.4 54.7 65.2 19.8
\checkmark \checkmark ×\times 67.2 63.1 55.7 71.0 87.6 69.6 75.8 35.0
\checkmark \checkmark \checkmark 68.8 64.8 57.9 72.9 88.0 72.2 78.5 37.2
Table 5: Ablation study on various training settings.
Benchmark Sequence Concat. Channel Concat.
VQAT{}^{\text{T}} 68.8 67.3
GQA 64.8 62.8
VizWiz 57.9 54.3
MMB 72.9 70.9
POPE 88.0 87.6
SEED 72.2 72.0
SQA 78.5 78.4
MMVet 37.2 35.7
Table 6: Comparison of fusion methods in LEO.
Model #-Frame Lingo-J \uparrow BLUE \uparrow METEOR \uparrow CIDEr \uparrow
Tiling 2 61.00 14.91 35.44 69.72
No tiling 2 59.02 13.78 34.32 65.12
Table 7: Ablation study on tiling evaluated on the LingoQA benchmark [34]

Data format. In LingoQA, images are captured from a front-view camera with sequences of five frames. Due to computational limitations, we use only two frames during training. For this evaluation, we maintain the same pretraining data as in general training and use the LingoQA Scenery and Action training dataset [34] for the second stage. The training process remains as described in Section 4.1. Additionally, we reformat the LingoQA data into the standard conversational format described in Section 3.4.

Results. We evaluate our model on the LingoQA validation set [34], with results presented in Table 4. Our model demonstrates competitive performance against the closed-source LingoQA baseline [34], which is pretrained on over 2222M data samples, significantly outperforming it on the METEOR and CIDEr metrics. Without modifying its architecture or training recipe, LEO also surpasses all existing open-source baselines across all four metrics. Notably, LEO achieves higher scores than the base model [7], highlighting the effectiveness of its dual-branch design.

4.4 Ablation studies

Comparison with different training settings. To more effectively analyze the impact of training strategies for vision encoders in multimodal large language models, we conduct an ablation study on the vision encoder modules in our model (SAM-ViT-Large [18] and InternViT-300M [7]). This also provides insights into the contributions of each encoder to the model effectiveness. Each vision encoder processes 256 visual tokens. Results in Table 5 highlight three key findings. First, keeping the vision encoders frozen during training improves evaluation scores; for instance, unfreezing SAM reduces SEED performance by 6.62%. Second, InternViT alone performs better than SAM alone across all benchmarks, with SAM struggling on tasks like text recognition. This is likely due to InternViT’s large-scale pretraining, although when combined with InternViT, SAM enhances performance on this type of task. Finally, regardless of whether the SAM vision encoder is frozen, a hybrid MLLM consistently outperforms models with a single vision encoder.

Effect of fusion method. Sequence and channel concatenation are two primary approaches for fusing visual tokens in hybrid MLLMs. To investigate the impact of these fusion strategies on the performance of LEO, we conduct an experiment, with results presented in Table 6. Our findings reveal that sequence concatenation consistently outperforms channel concatenation across all benchmarks, highlighting its effectiveness in enhancing model performance. It is worth noting that these results are specific to post-adaptation fusion. For a broader comparison, refer to the results presented in Table 2 and Table 3, where our model performance is compared with models employing pre-adaptation fusion through four different methods: Brave [16] and Mousi [11] use sequence concatenation, LLaVA-HR [33] employs an MR-adaptor, Mini-Gemini [23] uses cross-attention, and Eagle [40] utilizes channel concatenation.

Effect of tiling. To investigate the effect of tiling, we train LEO with and without tiling using two frames. This yields 14 tiles with tiling. The performances of these models applied to the LingoQA benchmark [34] are given in Table 7. We see that tiling improves performance across all four metrics; for instance, it enhances Lingo-Judge by 3.4%. These results confirm that incorporating dynamic high-resolution inputs can enhance the model capacity for understanding driving scenes.

Refer to caption
Figure 5: Qualitative results of LEO’s enhanced visual understanding on various vision-language tasks. Some images are taken from the following benchamrks: MMVet [48], MMMU [49], TextVQA [42], and LingoQA [34]

4.5 Visualization

To highlight the visual understanding capabilities of LEO, we conduct a qualitative analysis as shown in Fig. 5. Our model is applied to a variety of vision-language tasks, including complex reasoning, detailed counting, OCR, spatial and mathematical reasoning, accounting analysis, and multi-image and multi-frame reasoning. With an efficient tile-level post-adaptation fusion strategy, LEO exhibits impressive performance across these challenging tasks. For example, our model can perform attribute-based counting, such as identifying the absence of parked cars while there are several moving vehicles in the driving scene. Beyond simple recognition, LEO demonstrates spatial awareness, enabling it to answer OCR-related questions like,“What is located to the right of the shampoo?”. In multi-image reasoning, it accurately identifies detail differences between images, such as the dog’s head being in different positions. LEO also demonstrates strong capabilities in multi-frame reasoning in the autonomous driving domain, including recognizing safe actions in dynamic scenes, such as stopping to allow a pedestrian to cross. Additionally, LEO excels in OCR tasks, effectively interpreting dense text, and handles complex mathematical and accounting problems, showcasing its strong reasoning abilities.

4.6 Limitation

The processing capacity of our model for input images is limited to a maximum of six patches (i.e., excluding the global context) due to the constraints of the language model context length and available computational resources. This restriction prevents support for higher-resolution images or a larger number of multi-image inputs.

5 Conclusion

In this work, we have introduced LEO, a powerful framework for multimodal large language models, whose core lies in a strategic design for hybrid multimodal models that enhances performance through a tailored combination of post-adaptation fusion and tile segmentation. We have also demonstrated that LEO can be easily extended to the specialized domain of autonomous driving without the need for extensive domain-specific adjustments. Comprehensive experiments on various zero-shot benchmarks demonstrate LEO’s effectiveness, which surpasses previous state-of-the-art models on the majority of tasks. We hope LEO serves as a foundation for advancing hybrid multimodal models and provides straightforward inspiration for adapting MLLMs to specialized domains.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • Cai et al. [2024] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  • Cao et al. [2024] Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21819–21830, 2024.
  • Chen et al. [2024a] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024a.
  • Chen et al. [2024b] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024b.
  • Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, march 2023. URL https://lmsys. org/blog/2023-03-30-vicuna, 3(5), 2023.
  • Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  • Driess et al. [2023] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  • Fan et al. [2024] Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, et al. Mousi: Poly-visual-expert vision-language models. arXiv preprint arXiv:2401.17221, 2024.
  • Gao et al. [2023] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  • Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • Gurari et al. [2018] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  • Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • Kar et al. [2024] Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. arXiv preprint arXiv:2404.07204, 2024.
  • Kembhavi et al. [2016] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Lee et al. [2024] Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. Moai: Mixture of all intelligence for large language and vision models. arXiv preprint arXiv:2403.07508, 2024.
  • Li et al. [2023a] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  • Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
  • Li et al. [2023c] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  • Li et al. [2024a] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024a.
  • Li et al. [2024b] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26763–26773, 2024b.
  • Lin et al. [2024] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024.
  • Lin et al. [2023] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  • Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a.
  • Liu et al. [2024b] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b.
  • Liu et al. [2024c] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024c.
  • Liu et al. [2025] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025.
  • Lu et al. [2024] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  • Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  • Luo et al. [2024] Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024.
  • Marcu et al. [2024] Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2024.
  • Masry et al. [2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  • Mathew et al. [2021] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Shi et al. [2024a] Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024a.
  • Shi et al. [2024b] Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, et al. Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998, 2024b.
  • Shi et al. [2016] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  • Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Tian et al. [2024] Ran Tian, Boyi Li, Xinshuo Weng, Yuxiao Chen, Edward Schmerling, Yue Wang, Boris Ivanovic, and Marco Pavone. Tokenize the world into object-level knowledge to address long-tail events in autonomous driving. arXiv preprint arXiv:2407.00959, 2024.
  • Tong et al. [2024] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Wang et al. [2024] Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533, 2024.
  • Yu et al. [2023] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024.
  • Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  • Zong et al. [2024] Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, and Yu Liu. Mova: Adapting mixture of vision experts to multimodal context. arXiv preprint arXiv:2404.13046, 2024.