Contents
1 서론
시각 언어 모델(VLMs)은 다양한 비전 및 크로스모달 작업을 수행할 수 있는 강력한 도구입니다. 이런 모델은 이미지 캡셔닝, 시각적 질문 응답, 시각적 지상화 등과 같은 작업을 다음 토큰 예측 문제로 변환할 수 있습니다. 그러나 대규모 언어모델을 처음부터 훈련하는 것은 복잡한 작업입니다. 이에 따라, 기존의 pre-trained 언어 모델을 활용하여 VLM을 훈련하는 방법이 모색되고 있습니다.
BLIP-2와 같은 기존의 접근 방식은 pre-trained 시각 인코더와 언어 모델을 연결하는 방식을 사용하지만, 이런 방식은 시각적 이해 능력이 제한적이며 성능이 최적화되지 않는 문제가 있습니다. 이런 문제는 image feature가 입력 텍스트 공간에서 완벽한 대응을 찾지 못하기 때문에 발생합니다.
CogVLM은 이 문제를 극복하기 위해 각 계층에 학습 가능한 시각 전문가를 추가함으로써, 언어 모델의 NLP 능력을 유지하면서 시각적 이해 능력을 향상시키는 방법을 제시합니다. 이는 image feature와 text feature가 동일한 입력 공간에 매핑되도록 하여, 더 깊은 통합을 가능하게 합니다.
2 방법
2.1 구조
CogVLM의 구조는 네 가지 주요 구성 요소로 이루어져 있습니다.
ViT 인코더 pre-trained ViT 인코더를 사용하여 이미지에서 특징을 추출하고, 이를 언어 모델의 text feature와 동일한 공간으로 매핑합니다. 이 과정은 image feature가 언어 모델의 입력 분포와 일치하도록 하여, 더 나은 통합을 돕습니다.
시각 전문가 모듈 각 계층에서 독립적인 QKV 행렬과 MLP를 사용하여 image feature을 처리합니다. 이 구조는 이미지와 텍스트 간의 깊은 통합을 가능하게 하여, 더 정확한 시각-언어 상호 작용을 실현합니다.
2.2 사전 훈련
이미지-텍스트 쌍을 사용하여 모델을 사전 훈련합니다. 이 과정은 이미지 캡셔닝 손실을 최소화하면서 시작되며, 추후 시각적 지상화 데이터셋을 사용하여 추가적인 사전 훈련을 진행합니다. 이런 단계적 훈련 방법은 모델이 시각적 맥락을 더 잘 이해하도록 합니다.
2.3 정렬
다양한 멀티모달 작업에 대한 파인튜닝을 통해 모델을 최적화합니다. 이 과정은 모델이 다양한 시각적 및 언어적 맥락에서 일관되게 동작하도록 돕습니다.
3 실험
3.1 이미지 캡셔닝
CogVLM은 여러 이미지 캡셔닝 벤치마크에서 최고의 성능을 보여주었습니다. 특히, NoCaps 및 Flickr 데이터셋에서 최고 점수를 달성하였고, 이는 모델의 강력한 시각적 이해 능력을 입증합니다.
3.2 시각적 질문 응답
다양한 VQA 벤치마크에서 평가한 결과, CogVLM은 일관되게 우수한 성능을 보여주었습니다. 이는 모델이 시각적 내용과 언어적 질문을 효과적으로 통합할 수 있음을 보여줍니다.
3.3 시각적 지상화
고품질의 시각적 지상화 데이터셋을 사용하여 훈련된 모델은 다양한 시각적 지상화 작업에서 최고의 성능을 달성하였습니다. 이는 모델이 텍스트와 이미지 사이의 복잡한 상호 작용을 정확하게 해석할 수 있음을 시사합니다.
3.4 실제 사용자 행동에서의 지시 사항 따르기
실제 사용자 행동에서의 CogVLM-Chat 모델의 능력을 평가한 결과, 모델은 모든 다른 공개적으로 사용 가능한 VLM보다 향상된 성능을 보여주었습니다. 이는 모델이 실제 상황에서도 효과적으로 작동할 수 있음을 확인시켜 줍니다.
3.5 소거 연구
모델의 다양한 구성 요소와 설정이 성능에 미치는 영향을 평가하기 위해 소거 연구를 수행하였습니다. 결과는 특정 설정이 모델의 성능에 긍정적인 영향을 미칠 수 있음을 보여줍니다.
Visual language models (VLMs) are versatile and powerful. Many vision and cross-modality tasks can be formulated as next token prediction, e.g., image captioning (Agrawal et al., 2019), visual question answering (Antol et al., 2015), visual grounding (Yu et al., 2016) and even segmentation (Chen et al., 2022a). Useful abilities like in-context learning (Tsimpoukelli et al., 2021) also emerge along with the improvement of downstream tasks when scaling up VLMs. However, to train a large language model is already non-trivial, and it is more challenging to train a VLM from scratch with the same NLP performance as well-trained pure language models like LLaMA2 (Touvron et al., 2023). Therefore, it is natural to investigate how to train a VLM from an off-the-shelf pretrained language model.
Figure 1: The performance of CogVLM on a broad range of multi-modal tasks compared with existing models.
The popular shallow alignment methods represented by BLIP-2 (Li et al., 2023) connect a frozen pretrained vision encoder and language model via a trainable Q-Former or a linear layer, mapping the image features into the input embedding space of the language model. This method converges fast, but the performance (BLIP-2 NoCaps CIDEr 121.6) is not as good as jointly training the vision and language modules, e.g., PaLI-X (NoCaps CIDEr 126.3). As for chat-style VLM trained by shallow alignment methods, e.g., MiniGPT-4 (Zhu et al., 2023), LLAVA (Liu et al., 2023b), and VisualGLM (Appendix D), the weak visual understanding ability manifests as hallucination. So, is it possible to retain the NLP capabilities of the large language model while adding top-notch visual understanding abilities to it?
CogVLM gives a “yes” answer. In our opinion, the root cause of the inferior performance of shallow alignment methods lies in the lack of deep fusion between vision and language information. This inspiration arises from the comparison between p-tuning (Liu et al., 2023e) and LoRA (Hu et al., 2021) in efficient fine-tuning, where p-tuning learns a task prefix embedding in the input while LoRA adapts the model weights in each layer via a low-rank matrix. As a result, LoRA performs better and more stable. A similar phenomenon might also exist in VLM, because in the shallow alignment methods, the image features act like the prefix embedding in p-tuning. More detailed reasons for the performance degradation of p-tuning and shallow alignment include:
A possible solution is to adapt the language model to the image-text joint training, which is adopted by PaLI (Chen et al., 2022b) and Qwen-VL (Bai et al., 2023a). However, in this way, the NLP ability is avoidably impaired, which might affect text-centered tasks, such as image-based poetry creation or introducing the background story of images. According to PaLM-E (Driess et al., 2023), making the language model trainable during VLM pretraining will lead to catastrophic forgetting, and drop 87.3% NLG performance for 8B language model.
CogVLM instead adds a trainable visual expert to the language model. In each layer, the image features in the sequence use a new different QKV matrix and MLP layer with the text features. Visual expert doubles the number of parameters while keeping the FLOPs the same. Since all the parameters in the original language model are fixed, the behaviors are the same as the original language model if the input sequence contains no image. Our CogVLM-17B trained from Vicuna-7B achieves state-of-the-art or the second-best performance on 14 classic cross-modal benchmarks, including 1) image captioning datasets: NoCaps, Flicker30k, COCO, 2) VQA datasets: VQAv2, OKVQA, GQA, TextVQA, VizWiz, 3) visual grounding datasets: RefCOCO, RefCOCO+, RefCOCOg, Visual7W, 4) multiple choice datasets: TDIUC, ScienceQA**. We also trained a CogVLM-28B-zh from ChatGLM-12B (Du et al., 2021) to support both English and Chinese for commerical use, which is not included in this paper.
Since most previous famous VLMs are close-source, including Flamingo (Alayrac et al., 2022), SimVLM (Wang et al., 2021), Coca (Yu et al., 2022), BEIT-3(1.9B) (Wang et al., 2022c), GIT2 (Wang et al., 2022a), PaLI (Chen et al., 2022b), PaLI-X (Chen et al., 2023b), we anticipate that the open-sourcing of CogVLM will greatly help the research and industrial application of visual understanding.
Figure 3: The architecture of CogVLM. (a) The illustration about the input, where an image is processed by a pretrained ViT and mapped into the same space as the text features. (b) The Transformer block in the language model. The image features have a different QKV matrix and FFN. Only the purple parts are trainable.
CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. Figure 3 shows an overview of the CogVLM architecture. The components’ design and implementation details are provided below:
ViT encoder. We utilize pretrained EVA2-CLIP-E (Sun et al., 2023) in CogVLM-17B. The final layer of ViT encoder is removed because it specializes in aggregating the [CLS] features for contrastive learning.
MLP adapter. The MLP adapter is a two-layer MLP (SwiGLU (Shazeer, 2020)) to map the output of ViT into the same space as the text features from word embedding. All image features share the same position id in the language model.
Pretrained large language model. CogVLM’s model design is compatible with any off-theshelf GPT-style pretrained large language model. Specifically, CogVLM-17B adopts Vicuna-7Bv1.5 (Chiang et al., 2023) for further training. A causal mask is applied to all the attention operations, including the attention between image features.
Visual expert module. We add a visual expert module to each layer to enable deep visual-language feature alignment. Specifically, the visual expert module in each layer consists of a QKV matrix and an MLP in each layer. The shapes of the QKV matrix and MLP are identical to those in the pretrained language model and initialized from them. The motivation is that each attention head in the language model captures a certain aspect of semantic information, while a trainable visual expert can transform the image features to align with the different heads, therefore enabling deep fusion. Formally, suppose that the input hidden states of an attention layer are X ∈ RB×H×(LI +LT )×D, where B is the batch size, LI and LT are the lengths of image and text sequences, H is the number of attention heads, and D is the hidden size. In the attention with visual expert, X is first split as image hidden states XI and text hidden states XT , and the attention is computed as:
where WI , WT are the QKV matrices of the visual expert and original language model, and Tril(·) means lower-triangular mask. The visual expert in FFN layers performs similarly, where FFNI and FFNT are the FFN of the visual expert and original language model.
Data. The image-text pairs for pretraining are all publicly available, including LAION-2B and COYO-700M. After removing the broken URLs, NSFW images, images with noisy captions, images with political bias and images with an aspect ratio > 6 or < 1/6, about 1.5B images are left for pretraining.
We also crafted a visual grounding dataset of 40M images. Each noun in the image caption is associated with bounding boxes to indicate the positions in the image. The construction process basically follows Peng et al., which extracts nouns via spaCy (Honnibal & Johnson, 2015) and predicts the bounding boxes using GLIPv2 (Zhang et al., 2022). The image-text pairs are sampled from LAION-115M, a subset of LAION-400M filtered by Li et al. (2023). We filter and retain a subset of 40 million images to ensure that over 75% of images contain at least two bounding boxes.
Training. The first stage of pretraining is for image captioning loss, i.e. next token prediction in the text part. We train the CogVLM-17B model on the 1.5B image-text pairs introduced above for 120,000 iterations with a batch size of 8,192. The second stage of pretraining is a mixture of image captioning and Referring Expression Comprehension (REC). REC is a task to predict the bounding box in the image given the text description of an object, which is trained in the form of VQA, i.e., “Question: Where is the object?” and “Answer: [[x0, y0, x1, y1]]”. Both x and y coordinates range from 000 to 999, meaning the normalized position in the image. We only consider the loss of the next token prediction in the “Answer” part. We pretrain the second stage for 60,000 iterations with a batch size of 1,024 on the text-image pairs and visual grounding datasets introduced above. During the final 30,000 iterations, we change the input resolution from 224 × 224 to 490 × 490. The total number of trainable parameters is 6.5B and the pretraining consumes about 4,096 A100×days.
We further finetune CogVLM on a broad range of tasks, so as to align CogVLM with free-form instructions of any topic. We name the finetuned model CogVLM-Chat. As the examples in Figure 2 and Appendix show, CogVLM-Chat can successfully align with diverse instructions, thus enabling flexible interaction with humans.
Data. The high-quality data for supervised fine-tuning (SFT) is collected from LLaVA-Instruct (Liu et al., 2023b), LRV-Instruction (Liu et al., 2023a), LLaVAR Zhang et al. (2023) and an in-house dataset, with a total of about 500,000 VQA pairs. The quality of SFT data is of vital importance, but the LLaVA-Instruct is generated by a pipeline involving language-only GPT-4 so that errors are inevitable. Particularly, we corrected the errors in the LLaVA-Instruct dataset via manual inspection and annotation.
SFT. For supervised fine-tuning, we train 8,000 iterations with a batch size of 640, a learning rate of 10−5 and 50 warm-up iterations.
In order to prevent overfitting the text answer of the dataset, we leverage a smaller learning rate (10% the learning rate of the other parameters) to update the pretrained language model. All the parameters except ViT encoder are trainable during SFT.
To rigorously validate the superior performance and robust generalization of our base model, we conduct quantitative evaluations on an array of multi-modal benchmarks. These benchmarks can be categorized into three broad areas covering a comprehensive range of measurement1:
We evaluate the image captioning capability of our pretrained base model on the aforementioned four benchmarks. In a zero-shot evaluation on the Nocaps and Flickr datasets, we assess the precision of our model in describing long-tail visual concepts. Additionally, we present results from fine-tuning on the COCO and TextCaps datasets.
The detailed performance is shown in Table 1. Overall, our model achieves the SOTA or compatible performance across the board. Specifically, on the NoCaps benchmark, our base model outperforms the previous best method, GIT2, across four splits with a maximum of 5.7 points in the out-domain set while only consuming 10% of the pretraining data (1.5B vs 12.9B). On the Flickr benchmark, our model achieves a SOTA score of 94.9 surpassing the concurrently released Qwen-VL model by 9.1 points. These results demonstrate a remarkable capability and robustness of our pretrained model on the image captioning task. We also evaluate on the COCO (Lin et al., 2014) and TextCaps, where the latter is specifically designed to integrate the textual information of the given image into captions. Though training without the dedicated OCR data, encouragingly, our base model reveals a significant text-reading ability and obtains a competitive performance with PaLI-X-55B, and outperforms the previous best model of the same scale, PaLI-17B, by 9.1 points score.
Visual Question Answering is a task of validating general multi-modal capabilities of models, which requires a mastery of skills including vision-language understanding and commonsense reasoning. We evaluate our model on 7 VQA benchmarks: VQAv2, OKVQA, GQA, VizWiz-QA, OCRVQA, TextVQA, ScienceQA, covering a wide range of visual scenes. We train our base model on the training sets and evaluate it on the publicly available val/test sets for all benchmarks, where both procedures adopt the open-vocabulary generation settings without OCR pipeline input.
As shown in Table 2, our model achieves state-of-the-art performance on 6 of 7 benchmarks com- pared with models of similar scales, such as PALI-17B and Qwen-VL. Our model even surpasses models of much larger scale on multiple benchmarks, such as PaLI-X-55B on VizWiz-QA (test-std +5.1, test-dev +3.8), PALM-E-84B on VQAv2 (test-dev +4.2) and OKVQA(+1.4), Flamingo-80B on VQAv2 (test-dev +2.7, test-std +2.6), VizWiz-QA (test-dev +10.7, test-std +10.4) and TextVQA (+15.6). Our model also achieves the optimal scores of 92.71 on the multi-modal split (i.e., IMG) of ScienceQA (Lu et al., 2022b), achieving a new SOTA. These results suggest that our base model can serve as a strong multi-modal backbone capable of solving various visual question answering tasks.
Detailed summary of all benchmarks and corresponding metrics are available at Appendix A.2.
Table 1: Performance on Image Captioning benchmarks, where all tasks use CIDEr as the evaluation metric. OOD refers to out-of-domain test set. Karp. refers to the Karpathy test split.
Generalist performance. In order to fairly compare with Unified-IO (Lu et al., 2022a), QwenVL (Bai et al., 2023a), mPLUG-DocOwl (Ye et al., 2023) and other models trained in a generalist paradigm across multi-modal tasks, we further trained a unified model using data composed of dozens of multi-modal datasets and utilized a consistent checkpoint for evaluation. The datasets encompass 14 QA datasets such as VQAv2, OKVQA, and extending to TextVQA, as well as caption datasets including COCO caption, TextCaps, and those used during the pre-training phase. Experimental results show that multitask learning does not significantly reduce the model’s performance on individual tasks, and CogVLM remains leading in performance across all tasks.
Table 3: Generalist performance on Image Captioning and VQA benchmarks.
In order to endow our model with consistent, interactive visual grounding capabilities, we collect a high-quality dataset covering 4 types of grounding data:
After the second pretraining stage using our 40M visual grounding dataset, we continue to train our model on this high-quality dataset, resulting in a generalist grounding-enhanced model, CogVLMGrounding. It is noteworthy that the curated datasets exhibit a versatility of visual grounding capabilities, and many datasets can be adapted and repurposed across different tasks. For instance, grounded captioning datasets can be reformulated to suit REG and REC tasks. Taking the example of “A man [box1] and a woman [box2] are walking together.”, this can be reframed into question answering pairs like (“Describe this region [box2].”, “A woman.”) and (“Where is the man?”, “[box1]”). Similarly, REC datasets can be translated into REG tasks by switching the input and output, and vice versa. However, certain conversions might lead to ambiguities. For example, when presented with the isolated query “Where is another man?” from the caption “A man [box1] is running, while another man [box2] is looking.”, the distinction between [box1] and [box2] becomes unclear, potentially leading to errors.
Table 4 shows the result on the standard visual grounding benchmarks. We find that our generalist model achieves state-of-the-art performance across the board, with a significant advantage over the previous or concurrent models. Moreover, we also evaluate the specialist performance of our model finetuned on each individual training set of benchmarks for fair comparison with the best models dedicated on each task. As shown in the bottom part of Table 4, our model achieves the SOTA performance over 5 of 9 splits, and the compatible result on the other subsets. These results suggest a remarkable visual grounding capability of our model incorporating our training paradigm.
Table 4: Results on Referring Expression Comprehension and Grounded Visual Question Answering.
To evaluate the CogVLM-Chat model’s capacity under real-world user behavior, we further employ TouchStone (Bai et al., 2023b), an extensive benchmark for multimodal language models. Table 5 shows the GPT-4 (OpenAI, 2023) similarity scores of the generated and standard answer, suggesting CogVLM-Chat significantly outperforms all the other publicly available VLMs.
Table 5: Evaluation results on TouchStone in English.
To understand the impact of various components and settings on our model’s performance, we conduct an extensive ablation study for 6,000 iterations and a batch size of 8,192. Table 6 summarizes the results about the following aspects:
Model structure and tuned parameters. We investigate the effectiveness of tuning only the MLP Adapter layer or tuning all LLM parameters and the Adapter without adding VE, as well as modifying the VE architecture to add full VE at every 4th LLM layer or only the FFN-equipped VE at all layers. From the results we can see that only tuning the adapter layer (e.g., BLIP2) may result in a shallow alignment with significantly inferior performance, and decreasing either the number of VE layers or the VE parameters at each LLM layer suffers a prominent degradation.
Initialization Method. We investigate the effectiveness of initializing VE weights from LLM, and the slight decrease in performance suggests a positive impact of this method.
Visual Attention Mask. We empirically find that using a causal mask on visual tokens will yield a better result in comparison with a full mask. We hypothesize the possible explanation for this phenomenon is that the causal mask better fits the inherent structure of LLM.
Image SSL Loss. We also investigated the self-supervised learning loss on image features, where each visual feature predicts the CLIP feature of the next position for visual self-supervision. Align with the observation from PaLI-X (Chen et al., 2023b), we find it brings no improvement on downstream tasks, although we indeed observed improvements in small models in our early experiments.
We utilize EMA (Exponential Moving Average) during pretraining, which often brings improvements across various tasks.
In this paper, we introduce CogVLM, an open visual language foundation model. CogVLM shifts the paradigm for VLM training from shallow alignment to deep fusion, achieving state-of-the-art performance on 10 classic multi-modal benchmarks.
The VLM training is still in its infancy, and there are many directions to explore, for example, better SFT alignment, RLHF and anti-hallucination. Since the previous famous VLMs are mostly closed-source, we believe CogVLM will be a solid foundation for future multi-modal research.