MinWoo(Daniel) Park | Tech Blog
Read more1 서론
최근 언어 모델링과 이미지 이해 분야에서 눈에 띄는 진전이 이루어졌습니다. 대규모 이미지-텍스트 데이터와 계산 리소스의 확장으로, 다양한 언어 및 이미지 이해 문제에 대응하는 대규모 언어모델(LLMs)과 비전 기반 모델이 개발되었습니다. 이런 발전을 바탕으로, 이미지와 텍스트 데이터를 동시에 처리할 수 있는 멀티모달 대규모 언어모델(MLLMs)이 등장하였습니다.
2 선행 연구
이 연구에서 다루는 MLLMs는 텍스트와 시각 토큰을 소비하는 강력한 pre-trained 자기회귀 LLM을 기반으로 합니다. 이런 접근 방식은 디코더 전용 아키텍처를 사용하며, 이미지 인코더를 통해 시각 데이터를 처리합니다. 최근 연구는 pre-trained LLM 위에 시각 지시 튜닝을 중점적으로 다루고 있습니다.
3 MM1 구축 방법
MLLM을 구축하는 과정은 실증적인 시도가 많은 부분입니다. 여러 설계 결정에 대해 절제 연구을 수행하여 성능이 향상된 모델을 도출했습니다.
3.1 실험 설정
각 설계 결정의 영향을 평가하기 위해 모델의 기본 구성에서 절제 연구을 수행합니다. 이를 통해 최종 모델-데이터 구성을 결정하고, 모델 파라미터와 훈련 시간을 확장합니다.
3.2 모델 아키텍처 절제 연구
LLM이 시각 데이터를 처리할 수 있도록 하는 구성 요소를 분석합니다. 주요 고려 사항으로는 이미지 인코더의 사전 훈련 방법과 LLM 공간으로의 시각적 특징 연결 방법이 있습니다.
[이미지 인코더 사전 훈련]
CLIP 사전 훈련 이미지 인코더를 사용했습니다. 다양한 이미지 인코더를 절제 연구하는 과정에서 이미지 해상도와 인코더 사전 훈련 목표가 downstream 결과에 상당한 영향을 미칩니다.
\[\text{Encoder Lesson:} \, \Delta \text{Performance} \approx 3\% \text{ for increased resolution; } <1\% \text{ for increased model size}\]3.3 pre-training 데이터 절제 연구
성능이 우수한 모델을 훈련하기 위해선 대규모 및 과제 적합 데이터가 중요합니다. 이미지 캡션 데이터, 상호 연결된 이미지-텍스트 문서 및 텍스트 전용 데이터를 사용했습니다.
데이터 혼합(Mixed dataset)의 중요성
훈련 단계에서 데이터의 유형과 혼합 비율이 모델의 성능에 큰 영향을 미칩니다. 이미지 데이터와 텍스트 데이터의 적절한 혼합은 멀티모달 및 텍스트 성능을 최적화합니다.
\[\text{Data Lesson 1:} \, \text{Interleaved data boosts few-shot performance; Caption data boosts zero-shot performance}\]4 최종 모델 및 훈련 레시피
이전의 절제 연구 결과를 바탕으로 MM1 멀티모달 사전 훈련의 최종 레시피를 결정합니다. 이를 통해 모델의 성능을 향상시키고, 3B, 7B, 30B 파라미터로 LLM 크기를 확장합니다. 모든 모델은 대규모 데이터 혼합에서 사전 훈련되며, 효율적인 학습률 최적화를 통해 최종 MM1-30B 모델을 구축합니다.
모델 스케일링 및 MoE 활용
MoE 아키텍처를 사용하여 모델 파라미터를 확장하면서도 인퍼런스 속도를 유지합니다. 64개의 전문가가 있는 3B-MoE 모델과 32개의 전문가가 있는 7B-MoE 모델을 설계하고 훈련했습니다.
지도 학습(SFT)에 대한 방법
SFT는 MM1 모델의 성능을 향상시키기 위한 중요한 단계로, pre-trained 모델을 다양한 실제 어플리케이션에 적용 가능하게 합니다. 데이터 혼합, 고해상도 지원, 그리고 유연한 백본 구조는 모델이 다양한 시나리오에서 좋은 성능을 보였다고 보고합니다.
방법 | 설명 |
---|---|
데이터 혼합 | 다양한 데이터셋를 혼합하여 학습에 사용 |
이미지 인코더 & LLM 유지 | SFT 동안 이미지 인코더와 LLM 백본을 고정하지 않고 유지 |
고해상도 지원 | 위치 임베딩 보간법과 하위 이미지 분해를 통한 고해상도 지원 |
성능 평가 | 12개 벤치마크 데이터셋에서 모델 평가 |
데이터 혼합 (SFT Data Mixture)
이미지 인코더와 LLM 백본 유지
고해상도 지원
성능 평가
[해상도 선택 관련 색인마킹]
실험 결과 및 평가
체계적인 접근 방식과 실증적 검증을 통해, 다양한 멀티모달 벤치마크에서 경쟁력 있는 성능을 달성할 수 있었습니다.
5 결론
연구는 고성능 멀티모달 대규모 언어모델(MLLMs) 구축에 초점을 맞추고 있습니다. 다양한 모델링 결정과 데이터 전략에 대한 상세한 절제 연구 연구를 통해, pre-trained 모델이 여러 퓨샷 평가에서 최고의 성과를 달성하도록 중요한 교훈을 도출했습니다. 지도 학습(SFT) 후, 이 모델은 다양한 벤치마크에서 경쟁력 있는 성능을 보여주며, 다중 이미지 인퍼런스 및 퓨샷 프롬프팅을 가능하게 합니다. 이런 발견이 연구 커뮤니티가 특정 아키텍처나 데이터 전략의 한계를 넘어서 강력한 모델을 구축하는 데 도움이 되기를 바랍니다.
In recent years, the research community has achieved impressive progress in language modeling and image understanding. Thanks to the availability of large- scale image-text data and compute at scale, we have seen the emergence of highly performant Large Language Models (LLMs) [9, 10, 19, 21, 26, 92, 93, 102, 107, 109, 116, 131] and Vision Foundation Models [40, 88, 91] that have become the de- facto standard for the majority of language and image understanding problems.
Fig. 1: MM1 can perform in-context predictions thanks to its large-scale multimodal pre-training. This allows MM1 to (a) count objects and follow custom formatting, (b) refer to parts of the images and perform OCR, (c) demonstrate common-sense and word knowledge about everyday objects, and (d) perform basic math functions. Images are from the COCO 2014 validation set [72].
Given the above developments, an area of multimodal foundation models has emerged that marries the above advances into a single model achieving superior capabilities. In particular, Multimodal Large Language Models (MLLMs) are large-scale foundation models that consume image and text data and produce text [28, 67, 79, 110]. After the rise of LLMs, MLLMs are emerging as the next frontier in foundation models.
When it comes to transparency, existing MLLMs fall into two categories: closed models [1, 106] and open models [3–5, 77, 90]. In the former category, the models might be available for use, but little to nothing is known about the data, model architecture, and training details. In the latter category, the model parameters might be released together with a detailed description of data, model, and training configurations, thus allowing the community to build upon. However, most of the works, both open and closed, release close to nothing about the process they have undergone to arrive at their algorithmic design choices, especially regarding multimodal pre-training.
To further research in this area, we believe it is imperative to distill principles and lessons of how to build such models that might outlive concrete component implementations. Thus, in this paper, we document the MLLM building pro- cess and attempt to formulate design lessons, that we hope are of use to the community.
In particular, our contributions are as follows. First, we perform ablations at small scale across (1) model architecture decisions and (2) pre-training data choices. We identify several interesting trends. On the modeling side, we see that design aspects are in the following order of importance: image resolution, visual encoder loss and capacity, and visual encoder pre-training data. Surprisingly, though, we find little evidence that architectural decisions of how visual data is fed into the LLM matter.
Fig. 2: MM1 can follow instructions and reason across images. Example and images from VILA [71]; VILA answers correctly when prompted with chain-of-thought.
Further, we use three different types of multimodal pre-training data: image- caption, interleaved image-text, and text-only data. We see that when it comes to few-shot and text-only performance, interleaved and text-only training data is of paramount importance, while for zero-shot performance, caption data mat- ters most. We demonstrate that these trends hold after Supervised Fine-Tuning (SFT), both on the evaluations used in the pre-training as well as on further benchmarks. This shows that capabilities and modeling decisions discovered dur- ing pre-training are retained after fine-tuning.
Finally, we scale up our model by using larger LLMs, from 3B, 7B, to 30B, and by exploring mixture-of-experts (MoE) models, from 3B with 64 experts to 7B with 32 experts. This leads to a family of performant models, that outperforms most of the relevant works to the best of our knowledge. In particular, the pre- trained model MM1 is SOTA, performing better than Emu2 [105], Flamingo [3], and IDEFICS [47] on captioning and visual question answering (VQA) tasks in few-shot settings, both in small and large size regimes. The final models, after SFT, achieve competitive performance across 12 established multimodal benchmarks.
Thanks to large-scale multimodal pre-training, as shown in Figures 1 and 2, MM1 enjoys appealing properties such as in-context predictions, multi-image and chain-of-thought reasoning. MM1 also enables strong few-shot learning capability after instruction tuning. These strong results demonstrate that the presented recipe for building MLLMs translates the design principles to a competitive model at scale. We hope that these presented insights will remain relevant, even as specific modeling components and data sources evolve.
The type of MLLMs concerned in this work build upon a strong pre-trained autoregressive LLM that consumes both text and visual tokens, the latter ob- tained via an image encoder [5, 17, 28, 45, 64, 76, 90]. Our approach is based on a decoder-only architecture, akin to Kosmos-1 [45].
Recent research has increasingly focused on visual instruction tuning on top of the pre-trained LLM [63]. Prominent examples include LLaVA(-1.5/NeXT) [74- 76], MiniGPT-4 [134], mPLUG-Owl(-2/Doc) [123–125], Otter [60, 61], Instruct- BLIP [24], Honeybee [12], SPHINX(-X) [36, 73], to name a few. There is also a rich body of literature on constructing instruction-tuning data [15, 37, 66, 113, 132], enabling MLLMs for referring and grounding [14, 56, 90, 115, 126, 130], image generation and editing [34, 54, 105].
The body of work that focuses on thorough ablations, in particular also on the pre-training side, is relatively sparse. VILA [71] focuses on studying various components of multimodal pre-training, but falls short of providing optimiza- tion details or detailed pre-training evaluations. Emu2 [105], on the other side, provides details regarding pre-training optimization parameters and base model results. However, they do not provide ablations that justify the various com- ponent decisions. IDEFICS [58] is another work that provides details regarding large-scale multimodal pre-training. However, their focus is primarily on closely replicating the closed-source Flamingo [3] model.
In contrast to these previous works, we aim to provide details regarding all components of our pre-training strategy, from hyperparameters to data to archi- tecture. We also provide results for our base pre-trained models to help differen- tiate the impact of multimodal pre-training vs. instruction tuning. Furthermore, we provide extensive ablations on the precise impacts of decisions regarding vi- sual encoders, vision-language connectors, and pre-training data mixture.
Building performant MLLMs is a highly empirical endeavor. Although the high- level architectural design and training procedure are clear, their concrete form and execution is not. In this work, we present details of the ablations we have performed to arrive at a performant model. We explore three major axes of design decisions:
In order to identify what are good choices along each of the above axes, we need an efficient way to assess model performance. As training a large MLLM can take substantial resources, we utilize a simplified setup for ablations.
Fig. 3: Left: Model ablations: what visual encoder to use, how to feed rich visual data, and how to connect the visual representation to the LLM. Right: Data ablations: type of data, and their mixture.
More concretely, we use a smaller base configuration of our model that we ablate from. We modify one component at a time, either an architectural module or a data source, and assess the impact of the design choice for each of these components. This allows us to arrive to the final model-data configuration that we scale up, both in terms of model parameters as well as training time. The base configuration for ablations is as follows:
To evaluate the different design decisions, we use zero-shot and few-shot (4 and 8-shot) performance on a variety of captioning and VQA tasks: COCO Captioning [18], NoCaps [2], TextCaps [103], VQAv2 [38], TextVQA [104], VizWiz [39], GQA [46], and OK-VQA [82].
In this work, we analyze components that enable an LLM to process visual data. Specifically, we investigate (1) how to best pre-train a visual encoder, and (2) how to bridge the visual features to the space of the LLM (see Figure 3, left).
Image Encoder Pre-training. Most MLLMs use a CLIP pre-trained image encoder [24, 74, 76, 124], while recent works also started to explore vision-only self-supervised models, such as DINOv2 [73, 108], as the image encoder. Similar to these prior works, we find that the choice of the pre-trained image encoder can substantially impact downstream results both after multimodal pre-training and after instruction tuning. Here, we primarily ablate the importance of image resolution and image encoder pre-training objective. Note that unlike the rest of our ablations, here we use a 2.9B LLM (instead of 1.2B) to ensure there is sufficient capacity to utilize some of the larger image encoders.
Table 1: MM1 pre-training ablation across different image encoders (with 2.9B LLM). Note that the values in the Data column correspond to the data that was used for the initial training of the image encoder itself, not MM1. Recon.: Reconstructive loss. AIM: [30]; DFN-2/5B: [31]; VeCap: VeCap-300M [57]; OpenAI [91].
Contrastive losses. When trained on large-scale image-text datasets, the resulting models possess strong semantic understanding of the image data as evidenced by performance on various forms of image classification and retrieval tasks [91]. These results were enabled because of the availability of large-scale image-text data, which can endow a visual encoder with semantic knowledge. More recently, automatically curated large-scale datasets and synthetic captions have led to even stronger encoders [31, 57].
Reconstructive Losses. When it comes to dense prediction, CLIP-style models struggle to attain the same strong performance [94, 95, 112]. This property can be problematic for MLLMs, as many of the tasks such as VQA and captioning require detailed image understanding. Hence, we also consider image encoders learned using reconstructive losses, as such losses explicitly capture all parts of an image. In particular, we utilize AIM [30], which has shown that a carefully designed autoregressive reconstructive loss on image data alone scales well.
Encoder Lesson Image resolution has the highest impact, followed by model size and training data composition. As we can see in Table 1, increasing image resolution from 224 to 336 results in approx. 3% boost in all metrics across all architectures. Increasing the model size from ViT-L to ViT-H, a doubling in parameters, results in a modest performance increase of usually less than 1%. Finally, adding VeCap-300M [57], a dataset of synthetic captions, yields more than 1% boost in few-shot scenarios.
When it comes to model type, the results are less conclusive. Contrastive methods tend to result in higher performance than reconstructive. In particular, encoders based on ViT-L of 300M parameters result in 0.3% to 1.5% performance gain compared to AIM600M of comparable size (only 20 of the 24 AIM model layers are used at inference). This lesson is, nevertheless, inconclusive for the potential of AIM as it has been trained on less than half the data. Similarly, the widely used open sourced OpenAI model [91] perform on-par with our model of comparable capacity but trained on DFN+VeCap data mixture.
Fig. 4: 0-shot, 4-shot, and 8-shot ablations across different visual-language connectors for two image resolutions, and two image token sizes.
Vision-Language Connector and Image Resolution. The goal of this component is to translate the visual representation to the space of the LLM. As image encoders are ViTs, their output is either a single embedding, or a set of gridarranged embeddings corresponding to the input image patches. Therefore, the spatial arrangement of the image tokens needs to be converted to the sequential one of the LLM. At the same time, the actual image token representations are to be mapped to the word embedding space.
While doing so, there are two conflicting requirements. On the one side, we would like to capture as much detail from the image as possible, fulfilled by increasing the number of image token embeddings. On the other side, especially in the case of multi-image input, having a large number of input tokens per image is computationally challenging.
We consider using 64 or 144 tokens to represent the image, as well as two different image resolutions, 224 and 336. Further, we consider the following architectural options:
Average Pooling. Following [105], we apply n×n average pooling on the output of the ViT image encoder, followed by a linear projection (n ∈ {8, 12}).
Attention Pooling. Motivated by the fact that image token representations are in a different space than the LLM input embeddings, attention pooling using k learnable queries, is a natural approach. By varying k one can vary the number of inputs from a single image that are fed into the LLM (we use k ∈ {64, 144}). Convolutional Mapping. More recently, Honeybee [12] has studied the above questions and proposed the C-Abstractor module. It is implemented as a ResNet [41] block that preserves local information while through adaptive pooling can change the number of image tokens.
VL Connector Lesson: Number of visual tokens and image resolution matters most, while the type of VL connector has little effect. The results shown in Figure 4 demonstrate that both zeroand few-shot performance increases as we increase the number of visual tokens or/and image resolution. However, contrary to what has been reported in the literature [12], different architectural designs do not appear to conclusively produce stronger models. After instruction tuning, all three architectures achieve very similar results at the 336px and 144 token setting. (See Appendix Figure 10 for fine-tuning results.)
Large-scale and task-appropriate data is of paramount importance in training performant models. Typically, models are trained in two stages, pre-training and instruction tuning. In the former stage web-scale data is used while in the latter stage task-specific curated data is utilized. In the following, we focus on the pre-training stage and elaborate our data choices (see Figure 3, right).
Table 2: List of datasets for pre-training multimodal large language models.
Two types of data are commonly used to train MLLMs: captioning data consisting of images with paired text descriptions; and interleaved image-text documents from the web (see Appendix A.1 for details). Note that captioning data tends to contain relatively short text with high relevance to the image. On the contrary, interleaved data has substantially longer and more diverse text with less relevance, on average, to the surrounding images. Finally, we include text-only data to help preserve the language understanding capabilities of the underlying pre-trained LLM. The full list of datasets is summarized in Table 2. We use the same model setup for ablations described in Section 3.1, with the only exception that we train 200k steps here to fully leverage the large-scale data training. We also incorporate a set of commonly employed text tasks, referred to as TextCore1, as part of the evaluation to better assess the effects of data mixture. These lead to the following lessons:
Fig. 5: Data Ablations. For each ablation, we present four different metrics: TextCore, 0-shot, 4-shot, and 8-shot. (a) Results with image data where we present five different mixing ratios between interleaved and captioned data. (b) Results with and without text-only data. We mix the text-only data separately with captioned and interleaved data. (c) Results with different mixing ratios between image data (caption and interleaved) and text-only data. (d) Results with and without including VeCap as part of caption data.
We collect the results from the previous ablations to determine the final recipe for MM1 multimodal pre-training:
In order to improve the model performance, we scale up the LLM size to 3B, 7B, and 30B parameters. We initialize both the image encoder and the underlying LLM decoder weights for MM1 from in-house pre-trained models2. We then perform multimodal pre-training on the above data mix for 200k steps (approx. 400B tokens). All models are pretrained entirely unfrozen with sequence length 4096, up to 16 images per sequence at 378×378 resolution, with a batch size of 512 sequences. All models are trained using the AXLearn framework.3
Fig. 6: Optimal peak learning rate as a function of model size. The data points represent experiments that achieved close-to-optimal 8-shot performance for their associated model size.
2 The LLM is pre-trained on the text-only data mixture mentioned in Sec. 3.3. 3 https://github.com/apple/axlearn
Model Scaling
At this scale, it is infeasible to do proper hyperparameter search. Instead, using established scaling characteristics of LLMs [43, 44, 120, 121], we perform a grid search of learning rate at small scale, \(9M\), \(85M\), \(302M\), and \(1.2B\), while using the components identified in Sec. 3.24 to identify the optimal learning rate and extrapolate it to a larger scale. We use a linear regression in log space to extrapolate from smaller to larger models (see Figure 6), resulting in the following prediction of optimal peak learning rate \(\eta\) given the number of (nonembedding) parameters \(N\):
\[\eta = \exp(-0.4214 \ln(N) - 0.5535) \tag(1)\]Similar to [48], we found in preliminary experiments that validation loss wasn’t strongly correlated with downstream task performance. Therefore, we directly use downstream 8-shot average performance for curve fitting.
For \(N = 3 \times 10^{10}\), this fit predicts \(\eta = 2.2 \times 10^{-5}\), which is what we use for the final MM1-30B. We initially performed a similar procedure to determine reasonable values for weight decay, denoted by \(\lambda\), but ultimately found that the simple rule of scaling weight decay by peak learning rate as \(\lambda = 0.1\eta\) worked well for all models. All further training details are described in Appendix B.
Scaling via Mixture-of-Experts (MoE)
MoE scales the total number of model parameters while keeping the activated parameters constant. It enjoys a larger model capacity without sacrificing inference speed significantly. Recently, MoE has shown promising results in language [23, 29, 32, 49, 136], multimodal [70, 87], and computer vision [16, 25, 55, 96] tasks.
In experiments, we further explore scaling the dense model by adding more experts in the FFN layers of the language model. Our MoE implementation generally follows GShard [59] and ST-MoE [136]. Specifically, we design two MoE models, a 3B-MoE using 64 experts that replaces a dense layer with a sparse layer in every-2 layers and a 7B-MoE using 32 experts that replaces a dense layer with a sparse layer in every-4 layers. The 3B-MoE contains 64B parameters in total and the 7B-MoE contains 47B parameters in total. We adopt top-2 gating with a load balance loss term with a 0.01 coefficient to encourage a better expert load balance and adopt a router z-loss term with a 0.001 coefficient to stabilize training. To convert a dense model to MoE, we only replace the dense language decoder with an MoE language decoder. The image encoder and the vision-language connector are kept the same. To train an MoE, we adopt the same training hyperparameters that are discovered for the dense backbone5 and identical 훈련 설정s including training data and training tokens. Multimodal Pre-training Results. We evaluate pre-trained models on captioning and VQA tasks via appropriate prompting.6 We evaluate zeroand few-shot, as shown in Table 3, and compare against the few approaches that report few-shot pre-training performance. Note that we only compare our model with larger models, e.g., comparing our 30B model with two 80B models.
4 The only exception is image encoder, which we downsize to the CLIPDFN+VeCap ViT-L with 336×336 resolution to reduce compute costs for the grid searches.
5 The dense backbone is defined to be the dense model we use to construct the MoE model.
6 The models are prompted with “{IMAGE} A photo of” for captioning, and “{IMAGE} Question: {QUESTION} Short answer:” for VQA. See Appendix C.1 for more details on pre-training evaluation. [이미지 프롬프트 MM1 색인마킹]
Table 3: Multimodal pre-training evaluations. (*) IDEFICS includes PMD in its training data (includes COCO). (†) These models include two text-only demonstrations in their “0” prompt, whereas MM1 does not. For the full table, see Table 6 in Appendix.
When it comes to few-shot performance, MM1 outperforms all published prior work for pre-trained MLLMs. We see superior performance at 30B across captioning benchmarks and the VizWiz-QA benchmark. On VQAv2, TextVQA, OKVQA, at that scale we are comparable to Emu2 [105]. For zero-shot performance7, even without instruction fine-tuning, our models perform favorably on TextCaps across all model sizes, and comparable to Flamingo-3B at small scales for most benchmarks.
7 We provide zero-shot results as a reference for the associated few-shot numbers, but we intentionally do not hill-climb on zero-shot metrics as they are mostly indicative of how well the pre-training mixture matches the associated evaluation task format.
In this section, we describe the supervised fine-tuning (SFT) experiments trained on top of the pre-trained models described in the previous sections. SFT Data Mixture. We follow LLaVA-1.5 [74] and LLaVA-NeXT [75], and collect roughly 1.45M SFT examples from a diverse set of datasets, including:
The academic VL datasets are formatted into the instruction-following format, following LLaVA-1.5 [74]. More details are provided in Appendix A.3. All datasets are mixed together and randomly sampled during training.9
During SFT, we keep both the image encoder and the LLM backbone unfrozen; other SFT training details are provided in Appendix B.2. We evaluate our models across 12 benchmarks (see Appendix C.2 for details). Scaling to Higher Resolutions. Intuitively, higher image resolution leads to better performance. To support high-resolution SFT, we use two approaches:
Positional embedding interpolation, e.g., as explored in Qwen-VL [5] and BLIP2 [65]. After positional embedding interpolation, the vision transformer backbone is adapted to the new resolution during fine-tuning. Through this method, we have fine-tuned our model to support image resolutions ranging from 448×448, 560×560, to 672×672. Note that, for a resolution of 672×672, with a patch size of 14×14, an image is represented with 2, 304 tokens.
Sub-image decomposition, recently introduced by SPHINX [73], Monkey [69], and LLaVA-NeXT [75]. Computing self-attention among more than 2, 000 image tokens is computationally challenging, limiting further scaling to even higher image resolutions. Following SPHINX [73], as shown in Figure 7a, for a high-resolution input image, e.g., 1344 × 1344, we construct five images of 672 × 672, and feed them as independent images into our visual encoder.
8 We also experimented with LVIS-Instruct4V [113], but did not observe better performance than using ShareGPT-4V [15], thus it is not included in the final mixture. 9 While some different data mixing strategies were explored, simply mixing these datasets already achieves good performance, similar to observations in Honeybee [12].
Table 4: Comparison with SOTA models on MLLM benchmarks. VQAv2 [38]; VQAT: TextVQA [104]; SQAI: ScienceQA-IMG [81]; MMMU [128]; MathV: MathVista [80]; MMEP/C: the Perception/Cognition split of MME [33]; MMB: MMBench [78]; SEED: SEED-Bench [62]; POPE [68]; LLaVAW: LLaVA-Bench (In-the-Wild) [76]; MMVet [127]. The two numbers reported in MMMU denote the performance on the val and test split, respectively. The two numbers reported in SEED denote the performance on the whole SEED-Bench and the image part, respectively. (†) 8-shot prompting: 44.4.
Specifically, we first downsample the input image to 672 × 672 as a high-level representation, and also resize the input image to 1344 × 1344 and divide the resized image into 4 sub-images of 672×672, which preserve more detailed visual information. Using positional embedding interpolation for each sub-image, we can support image resolution as high as 1792×1792 in experiments.
Comparison with SOTA. Results are summarized in Table 4. We use “-Chat” to denote our MM1 models after SFT. First, on average, MM1-3B-Chat and MM1-7B-Chat outperforms all listed models of the same size, setting a new state of the art for these model sizes. MM1-3B-Chat and MM1-7B-Chat show particularly strong performance on VQAv2, TextVQA, ScienceQA, and also the more recent benchmarks (MMMU and MathVista).
Second, we explore two MoE models (i) 3B-MoE with 64 experts, and (ii) 7B-MoE with 32 experts. Our MoE models achieve uniformly better performance than the dense counterpart on almost every benchmark. This shows the great potential of MoE for further scaling, which is left as future work.
Third, for the 30B model size, MM1-30B-Chat outperforms Emu2-Chat37B [105] and CogVLM-30B [114] on TextVQA, SEED, and MMMU. Compared with the concurrent LLaVA-NeXT [75], we also achieve competitive performance across the board. However, LLaVA-NeXT does not support multi-image reasoning, nor few-shot prompting, as each image is represented as 2,880 tokens sent to the LLM, while ours is only 720 in total. This limits certain applications that involve multiple images.
Impact of Image Resolution. Figure 7b shows the impact of input image resolution on the average performance of the SFT evaluation metrics (defer the details of how we calculate the meta-average to Appendix C.3). Compared to a baseline model with an image resolution of 336 pixels, we can achieve a 15% relative increase by supporting an image resolution of 1344 × 1344. Note that for the largest image resolution of 1792 × 1792, average performance decreases slightly. This is likely because many of the evaluation images are smaller than this resolution, and resizing artifacts may affect the model performance. By default, the results in Table 4 correspond to image resolutions of 1344×1344.
Impact of Pre-training. In contrast to most recent MLLMs, we perform largescale pre-training for our models. To assess the impact of pre-training on the final model performance, we perform SFT on the same pre-training run, but at different checkpoint steps. For an earlier checkpoint step, the model has seen less unique data samples than a later checkpoint step, so this is a measure of the importance of the quantity of pre-training data. In Figure 7c, we show that the model consistently improves as it has seen more pre-training data. Furthermore, large-scale multimodal pre-training enables strong in-context few-shot learning and multi-image reasoning capabilities, while most MLLM benchmarks shown in Table 4 focus on zero-shot metrics and single-image reasoning.
Few-shot Chain-of-Thought Reasoning after SFT. As seen in Section 3.3, MM1 gains few-shot capabilities thanks to interleaved data. Even though our Image Encoder (with interpolation) Sequence of encoded imagesImage Crops fine-tuning data includes only single-image examples, we find that MM1-30B-Chat still exhibits multi-image reasoning. This is shown qualitatively in Figure 2, and quantitatively on MathVista [80], where we evaluate few-shot performance with chain-of-thought prompting: 4-shot performance is 41.9, which is 2.5 points higher than zero-shot (39.4).
Our best performing high-resolution SFT model uses 720 tokens per image. This is a challenge when using more than 4 in-context examples due to the context length. To allow for more examples, we explore a mixed resolution in-context examples formulation, where we feed some of the examples at a lower resolution (see Appendix C.5 for details). Using this formulation with 8 in-context examples increases the performance on MathVista to 44.4.
Do the lessons learned via pre-training transfer to SFT? Yes. We find that (1) pre-training with caption-only data improves SFT metrics, and (2) different VL connector architectures have negligible impact on final results. Detailed ablation results are provided in Appendix C.4.
Qualitative Analysis. To better understand MM1, more qualitative examples are provided in Appendix D, including single-image and multi-image reasoning, and few-shot prompting.
We study how to build performant MLLMs. Through carefully ablating modeling and data choices, we identify important lessons that yield a pre-trained model achieving SOTA results on a range of few-shot evaluations. After SFT, this model family produces competitive performance on a wide range of benchmarks, while enabling multi-image reasoning and few-shot prompting. We hope that the identified lessons will help the community in building strong models beyond any single specific model architecture or data strategy.