00:00:00

Share Your Feedback 🏝️

Model | Kosmos-G

Model | Kosmos-G

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Mistral 7B Next: Model | Kosmos-2.5

Model | Kosmos-G

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-17

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

  • url: https://arxiv.org/abs/2310.02992
  • pdf: https://arxiv.org/pdf/2310.02992
  • abstract: Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of “image as a foreign language in image generation.”

Contents

TL;DR


  1. KOSMOS-G는 MLLM과 클립의 시맨틱 정렬을 이용하여 이미지 생성을 수행합니다.
  2. 다양한 지시에 따라 세밀하게 이미지를 생성할 수 있는 지시 튜닝 방법을 제안합니다.
  3. 0-shot 멀티 엔티티 생성능력을 갖춘 첫 모델입니다.

1. 서론

최근 텍스트-이미지 변환(T2I) 생성기술은 텍스트 기반의 설명으로부터 사실적이고 정확한 이미지를 생성하는 데 큰 진전을 이루었습니다. 이런 성공을 바탕으로 여러 연구에서는 더욱 정교한 시각-언어-이미지(VL2I) 생성 기술에 대해 탐구하고 있습니다. 특히, Re-Imagen, Prompt Diffusion, SuTI 등의 모델들은 이미지 특징을 확산 모델의 U-Net에 통합하여 VL2I 작업을 수행하려고 시도하였습니다. 그러나 이런 접근 방식은 텍스트와 이미지 사이의 통합된 모델링의 효율성을 제한하고, 다중 엔티티 시나리오로의 확장이 어렵습니다.

이런 한계를 극복하기 위해, MLLM의 기능을 활용하여 “이미지를 외국어처럼” 처리하는 KOSMOS-G 모델을 제안합니다. KOSMOS-G는 MLLM과 이미지 디코더 간의 정렬을 통해 지시에 따라 이미지를 정확하게 재현할 수 있는 능력을 갖추게 되었습니다. 이 논문은 KOSMOS-G의 설계와 학습 과정, 그리고 다양한 벤치마크에서의 성능 평가에 대해 상세히 논의합니다.


2. 방법

2.1 멀티모달 언어 모델링

KOSMOS-G는 시작(<s>)과 종료(</s>) 시퀀스, 이미지 시작(<image>) 및 종료(</image>) 토큰을 사용하여 입력 형식을 단일 시퀀스로 표현합니다. 이 방법은 텍스트 토큰과 이미지를 벡터로 인코딩하여 트랜스포머 기반 디코더에 입력하는 과정을 포함합니다.

수학적으로, KOSMOS-G는 다음 토큰 예측 작업을 사용하여 훈련됩니다. 훈련 목표는 토큰의 로그 가능도를 최대화하는 것입니다.

\[L = \sum_{i=1}^{N} \log P(w_i | w_1, ..., w_{i-1}; \Theta)\]

$w_i$는 시퀀스의 토큰, $N$은 시퀀스의 길이, $\Theta$는 모델의 파라미터입니다.

2.2 이미지 디코더 정렬

이미지 생성 능력을 향상시키기 위해, KOSMOS-G와 U-Net 이미지 디코더의 출력 공간을 정렬합니다. 이는 텍스트를 앵커로 사용하여 이미지 입력이 이미지 디코더와 호환되도록 하는 과정을 포함합니다.

수학적으로, 이 과정은 다음과 같이 설명할 수 있습니다.

\[\text{AlignerNet}: \min_{\Theta_M, \Theta_N} \|M(s; \Theta_M) - t\|^2 + \|N(M(s; \Theta_M); \Theta_N) - s\|^2\]

$s$는 소스 임베딩, $t$는 타겟 임베딩, $M$과 $N$은 각각 인코더와 디코더 함수, $\Theta_M$과 $\Theta_N$은 학습되는 파라미터입니다.

2.3 지시 튜닝

구성된 지시에 따라 이미지를 생성하는 작업을 통해 KOSMOS-G를 파인튜닝합니다. 이 과정은 텍스트와 이미지의 조합으로 구성된 입력을 사용하여 이미지를 생성하는 과정을 포함합니다.

수학적으로, 다음과 같은 손실 함수를 최소화하는 것을 목표로 합니다.

\[L = \text{ELBO} = -\mathbb{E}_{q(z|x)}[\log p_{\theta}(x|z)] + \text{KL}(q(z|x) \| p(z))\]

$x$는 실제 이미지, $z$는 잠재 표현, $q$는 인코딩 분포, $p$는 디코딩 분포, $\theta$는 모델 파라미터입니다.


3. 모델 훈련

KOSMOS-G를 웹 규모의 멀티모달 코퍼스에서 훈련시킵니다. 특히, 이미지-캡션 쌍과 이미지 편집 데이터를 사용하여 지시 튜닝 단계에서 모델을 파인튜닝합니다.


4. 평가

KOSMOS-G는 다양한 설정에서 인상적인 0-shot 생성 결과를 제공합니다. 주로 다중 엔티티 VL2I 작업에서 향상된 성능을 보여줍니다. 이는 KOSMOS-G가 지시에 따라 이미지를 세밀하게 생성할 수 있는 능력을 가지고 있음을 시사합니다.


1 Introduction

In recent studies, advancements in text-to-image (T2I) generation, particularly with diffusion models, have shown remarkable progress in producing highly photorealistic, accurate, and varied images from textual descriptions. Building on the unprecedented success of producing highly accurate images from text descriptions, numerous studies have delved into more sophisticated vision-language-to-image (VL2I) generation techniques. Methods such as DreamBooth [RLJ+22] and SuTI [CHL+23] emphasize subject-driven generation, where they use both subject images and textual descriptions as inputs to render the subject in a newly described context. On the other hand, image editing models like InstructPix2Pix [BHE23] accept original images and editing instructions to produce modified images as outputs. However, how to generate images from generalized vision-language inputs remains under-explored.

Many studies have been undertaken to accomplish this objective. Notably, Re-Imagen [CHSC22], Prompt Diffusion [WJL+23], and SuTI [CHL+23] inject image features into the U-Net of diffusion models. These models integrate images and textual guidance to address specific VL2I tasks. Specifically, Re-Imagen focuses on retrieve-augmented image generation, Prompt Diffusion emphasizes subject-driven generation, and SuTI specializes in in-context generation. However, such injection methods segregate the guidance for text and images, thereby limiting the effectiveness of joint modeling between the two modalities. Additionally, this approach is challenging to extend to scenarios involving multiple entities. Multimodal Large Language Models (MLLMs) [ADL+22, HSD+22, AHR+22, HDW+23, LLSH23] have significantly expanded the capabilities of language models, allowing them to process diverse modalities such as images. This multimodal perception empowers LLMs to undertake tasks previously deemed impossible, including document intelligence and understanding graphical user interfaces. Recent research has utilized MLLMs for Vision-Language-to-Image (VL2I) tasks. This approach presents several advantages: 1) It capitalizes on the inherent vision-language alignment within the MLLM. 2) The MLLM architecture naturally supports interleaved vision-language input, accommodating multiple images. One of the pioneering works in this domain is M-VADER [WBE+22], which achieves semantic alignment between the MLLM and the diffusion image decoder by training on image-caption pairs. GILL [KFS23], Emu [SYC+23], and DreamLLM [DHP+23] focus on interleaved vision-language generation. They effectively align the output space of the MLLM with the diffusion image decoder through CLIP supervision or pre-training on multimodal corpora. However, this alignment predominantly remains at the semantic level, meaning these methods may not be good at detailed, subject-driven image generation. BLIP-Diffusion [LLH23] learns object representations by synthesizing images through the composition of subjects with random backgrounds. This approach effectively endows it with a zero-shot, subject-driven text-to-image generation capability. However, the specific design of its input template and training data restricts its scalability to multiple entities.

To support generalized vision-language inputs across multiple entities, we present KOSMOS-G, which leverages the property of MLLM following an “align before instruct” manner. Specifically, we start

Figure 2: KOSMOS-G comprises an MLLM for multimodal perception, coupled with an AlignerNet that bridges the MLLM to the diffusion U-Net image decoder. KOSMOS-G can pass the fine conceptlevel guidance from interleaved input to image decoder, and offer a seamless alternative to CLIP. Orange denotes the trainable modules; Blue denotes the frozen ones.

from the multimodal language modeling stage, leading to the KOSMOS-1 [HDW+23] MLLM. It envisions language models as a universal task layer, perceiving free-form interleaved vision-language inputs and consolidating various task predictions into textual formats. Given the aligned vision-language representation, we then use the language modality as an anchor and align the output space of the MLLM with the CLIP text encoder. Finally, we perform instruction tuning on the curated data. KOSMOS-G accepts captions as input, where each entity is followed by its segmented image. The model is trained to faithfully reproduce all entities, render the text content, and follow the instructions. In this process, the frozen pre-trained diffusion image decoder serves as a score metric. We distill the learned data distribution to pass the differentiable gradient to the MLLM. This enables KOSMOS-G to harness rich features from the image encoder to generate images faithfully reproducing the contents across various contexts (see Figure 1).

Benefiting from general-purpose pre-training, KOSMOS-G approaches the objective of “image as a foreign language in image generation.” This means KOSMOS-G can capture novel concepts from input images and guide personalized creations in a zero-shot setting. Notably, KOSMOS-G also stands as the first model to master zero-shot multi-entity subject-driven generation. Owing to the score distillation instruction tuning, KOSMOS-G do not need to modify any parameters of the image decoder, i.e., the diffusion U-Net and VAEs. This makes it possible for us to seamlessly substitute CLIP with KOSMOS-G in any image generation system. As a result, a plethora of applications can be unlocked in conjunction with U-Net techniques, ranging from fine-grained controls like ControlNet [ZA23] to personalized or stylized image decoder variants like amazing community contributed LoRA [HSW+22] checkpoints.

Overall, we propose KOSMOS-G as an initial attempt towards the objective of “image as a foreign language in image generation.” We summarize our main contributions as follows:

  1. We align the output space of MLLM with CLIP using the text modality as an anchor, efficiently leverage the multimodal perception of MLLMs for image generation.
  2. We propose a compositional instruction tuning task, leading to amazing zero-shot multientity subject-driven generation capability.
  3. Score distillation instruction tuning allows KOSMOS-G to seamlessly interface with a spectrum of U-Net techniques, indicating broad applicability and potential for integration into various frameworks.

As shown in Figure 2, KOSMOS-G is a model that can perceive general modalities, follow instructions, and generate image conditions. Specifically, the backbone of KOSMOS-G MLLM is a Transformer-based causal language model, serving as a general-purpose interface to multimodal input. We train KOSMOS-G following an “align before instruct” manner, the entire training pipeline can be divided into 3 stages:

  1. Multimodal Language Modeling: We pre-train the MLLM from scratch on multimodal corpora, including monomodal data, cross-modal paired data, and interleaved multimodal data with language modeling loss following KOSMOS-1.

  2. Image Decoder Aligning: We use the U-Net [RFB15] of Stable Diffusion v1.5 [RBL+22] as our image decoder. We trained an AlignerNet on only textual data to align the output space of KOSMOS-G to U-Net’s input space through CLIP supervision. Here, the language acts as the anchoring modality, ensuring image input is also compatible with the image decoder.

  3. Instruction Tuning: We further fine-tune KOSMOS-G through a compositional generation

task on curated data, with the differentiable gradient passed from the frozen U-Net.

In Stage 1, only the MLLM is trained. In Stage 2, AlignerNet is trained with MLLM frozen. During Stage 3, both AlignerNet and MLLM are jointly trained. The image decoder remains frozen throughout all stages.

2.1 Multimodal Language Modeling

Following KOSMOS-1, KOSMOS-G perceives general modalities in a unified way. To achieve this, we represent the input format as a single sequence using special tokens. Specifically, we use the tokens and to denote start- and end-of-sequence. We also incorporate and tokens to indicate the start and end of any embedded image representations within the sequence.

Our methodology involves encoding both text tokens and images into vectors, which are then fed into the decoder. For text tokens, we use a lookup table to map them into embeddings. To handle the input images, we employ a vision Transformer [DBK+21] as the embedding module. Furthermore, Resampler [ADL+22] is used as an attentive pooling mechanism to reduce the number of image embeddings. After obtaining the embeddings of an input sequence, we feed them into the Transformer-based decoder. The left-to-right causal decoder processes the sequence in an auto-regressive manner. A softmax classifier on the Transformer is used to assign probabilities to each token in the vocabulary.

KOSMOS-G is first trained using the next-token prediction task. The training objective is to maximize the log-likelihood of tokens in examples. It’s important to note that the training loss only takes into account discrete tokens, specifically text tokens. The MLLM component has 24 layers with 2,048 hidden dimensions, 8,192 FFN intermediate size, and 32 attention heads. For faster convergence, the image representation is obtained from a pre-trained CLIP ViT-L/14 model with 1,024 feature dimensions. The images are preprocessed into 224×224 resolution during training. We freeze the parameters of the CLIP model except for the last layer during training. The total number of parameters of the MLLM is about 1.6B.

2.2 Image Decoder Aligning

After undertaking multimodal language modeling, we have successfully aligned vision and language perception within MLLM. To make KOSMOS-G capable of image generation, we incorporate diffusion models [SWMG15] as our image decoder. Specifically, we adopt the widely accepted Stable Diffusion v1.5 [RBL+22]. It’s important to note that we only replace the CLIP text encoder [RKH+21] with multimodal KOSMOS-G, without making any modifications to the U-Net architecture or weight. This setup allows KOSMOS-G to effectively collaborate with techniques applied to the U-Net, like ControlNet [ZA23] and various community LoRA [HSW+22] variants. In this section, we will provide brief preliminaries of latent diffusion models, and then delve into the process of aligning the output space of KOSMOS-G with the image decoder after the aforementioned replacement.

Preliminaries of Latent Diffusion Models Diffusion models define a Markov chain of forward diffusion process q, adding Gaussian noise samples to the initial real data z0 ∼ q(z) over T steps. Here, z denotes latent representations rather than pixel values. The efficient, low-dimensional latent space is approximately perceptually equivalent to high-dimensional RGB space, while the redundant semantically meaningless information present in the pixel domain is eliminated. Perceptual compression models (i.e., VQ-VAE) consisting of E and D encode the real data into the latent space and reverse, such that D(E(x)) ≈ x. Latent diffusion models use latent representations z = E(x) instead of working directly with pixel values during the diffusion process. The final output can be decoded back to pixel space via D(z). The separate mild perceptual compression stage only eliminates imperceptible details, leading to competitive generation results with a much lower cost. The forward process q(zt|zt−1) at each time step t can be expressed as follows:

Diffusion models learn a U-Net [RFB15] denoted as ϵθ to reverse the forward diffusion process, constructing desired data samples from the noise. Let αt = 1 − βt and ¯αt = (cid:81)t i=1 αi. We can reparameterize the denoising process p(zt−1|zt) also as a Gaussian distribution. This distribution can be estimated by ϵθ and takes the following form:

The learning objective of diffusion models is to approximate the mean µθ(zt, t) in the reverse diffusion process. To achieve this, we can utilize the variational lower bound (ELBO) [KW14] to minimize the negative log-likelihood of pθ(z0) [HJA20]. The simplified objective can be expressed as a denoising objective:

where w is guidance scale, φ denotes the condition.

Align Output Space with Diffusion Models Upon replacing the previous CLIP text encoder with KOSMOS-G, the main focus is to address the misalignment issue between the KOSMOS-G and the image decoder. We discovered that simply fine-tuning KOSMOS-G using the gradient passed from the image decoder results in both trivial alignment and compromised image quality. Inspired by [QYX+23], we propose the AlignerNet consisting of an encoder M and a decoder N to learn the alignment between the KOSMOS-G source space S and CLIP text encoder target space T. Given a single text-only caption C, KOSMOS-G source encoder and CLIP text target encoder encode the caption into embeddings denoted as s ∈ Rls×ds and t ∈ Rlt×dt, respectively. Here, l and d indicate the length of features and embedding dimensions.

As shown in Figure 3a, we employ the encoder M to minimize the distance between the text source embedding and the target embedding, aiming for a close approximation M(s) ≈ t through:

To mitigate the reduction in feature discrimination, we also employ a decoder N to reconstruct the source embedding N (M(s)) ≈ s through:

(a) Align process. Text serves as an anchor, image embeddings are naturally aligned throughout the process.

(b) AlignerNet architecture. The Linear layers are used to project the output dimension of MLLM to d = 768, the purple elements denote the learned latent queries QM and QN .

Figure 3: Overview of alignment.

Different from [QYX+23], KOSMOS-G is a vision-language multimodal encoder. The language modality serves as an anchor throughout the process, aligning the entire KOSMOS-G space with the image decoder input space, thus also achieving semantic alignment for the image embeddings.

To efficiently process lengthy sequences consisting of multiple images and minimize memory usage, KOSMOS-G encodes the interleaved vision-language input sequence into variable-length embeddings. However, the use of variable length embeddings makes the MLP-based GlueNet [QYX+23] unsuitable for learning alignment. To address this, we employ a Transformer-based architecture in AlignerNet, enabling it to effectively align the source and target spaces with mismatched sequence lengths and embedding dimensions.

As shown in Figure 3b, both M and N share a similar architecture design, consisting of a Transformer encoder and a Transformer decoder. The Transformer encoder and decoder in both models comprise 12 layers, with an input dimension d = 768 and a hidden dimension of 3072. This configuration results in approximately 225M parameters in total. In the cross attention module of Transformer decoder, we use variable length learned latent queries QM ∈ Rlt×d in M and QN ∈ Rls×d in N to match sequence length.

2.3 Instruction Tuning

After achieving a semantic alignment between KOSMOS-G and the image decoder, our model can successfully generate images following interleaved vision-language guidance. However, the multimodal language modeling and text-only alignment stage only preserve the semantic consistency between the input and output, KOSMOS-G still can not leverage rich features extracted from the image encoder to generate images faithfully reproducing the contents in various contexts.

To pursue our objective of “image as a foreign language in image generation,” we curate interleaved vision-language data and use the diffusion loss in Equation 3 to further fine-tune KOSMOS-G. Specifically, we propose a compositional generation task in which we input captions containing entities, with each of them followed by their corresponding images, like “<s> A cat <image> image embedding of the cat </image> and a dog <image> image embedding of the dog </image> sleeping in the garden <image> image embedding of the garden </image> </s>”.

Our model is trained to generate images following the input instruction.

To construct the requisite data, we first caption the image, then extract the entities from the caption, and obtain the segmentation results from the image itself. A detailed introduction of the entire pipeline can be found in Section 3.1. Additionally, we leverage the data constructed by [BHE23] for InstructPix2Pix to improve KOSMOS-G’s image editing capability. This data is structured as: “<s> caption <image> embedding of the original image </image> edit instruction </s>”. We also mix some text-to-image data to preserve the language alignment already achieved.

Figure 4: Overview of our data construction pipeline for compositional generation instruction tuning.

Our goal is to leverage MLLMs to model image distributions through direct latent space sampling. In this setup, the pre-trained frozen Stable Diffusion U-Net serves as a score metric, distilling the learned data distribution. This strategy is similar to Score Distillation Sampling [PJBM22]. From the perspective of score distillation, the KL divergence between KOSMOS-G and the score function is equivalently minimized for distilling learned probability density in the image decoder. This enables KOSMOS-G to leverage rich features from the image encoder to generate an image faithfully reproducing the contents across various contexts.

3 Model Training

3.1 Multimodal Training Data

The multimodal language modeling stage in Section 2.1 using the same setting of KOSMOS-1 [HDW+23], where the models are trained on web-scale multimodal corpora, consisted of text corpora, image-caption pairs, and interleaved data of images and texts. For the image decoder aligning stage in Section 2.2, we only use the caption from image-caption pairs. For the instruction tuning stage in Section 2.3, we use constructed data from Open Images V7 dataset [KRA+20], the image-caption pairs, as well as the image editing data from InstructPix2Pix [BHE23].

Captions The image-caption pairs are sourced from multiple datasets, including English LAION-2B [SBV+22], LAION-400M [SVB+21], COYO-700M [BPK+22], and Conceptual Captions [SDGS18, CSDS21]. English LAION-2B, LAION-400M, and COYO-700M are collected from Common Crawl web data by extracting images and the corresponding alt-texts. Conceptual captions are also derived from web pages.

Constructed Data We use approximately 9M images from the Open Images V7 dataset [KRA+20] to construct our compositional generation instruction tuning data. As illustrated in Figure 4, we begin by generating captions with BLIP-2-OPT-6.7b [LLSH23]. Subsequently, we employ an LLM MPT-7B-Instruct [Tea23] to extract entities from the captions. The original image, along with the text of each entity, is then input into the text-prompted segmentation model CLIPSeg [LE22] to derive the corresponding image of each entity.

3.2 Training Setup

Our implementation is based on the TorchScale [MWH+22] library, which is designed for large-scale model training. Following KOSMOS-1 [HDW+23], we also use MAGNETO [WMH+22], a Transformer variant, as the backbone architecture of our MLLM and AlignerNet. The whole training process took around four days with 256 NVIDIA V100 GPUs, i.e., one day for image decoder aligning, and three days for instruction tuning. In the instruction tuning stage, we use a blend of constructed data, InstructPix2Pix data, and caption data in a ratio of 2:2:1. For constructed data, to enhance input robustness, we randomly drop the texts of entities with a probability of 0.5 and also maintain the background of the segmented entities with a 0.5 probability.

Tags: bed, chair, tv, tableImageCaptionerCaption: A bedroom with a bed, a chair, a tv, and a tableLargeLanguageModelImage Segmentation ModelA bedroom with a bed

Figure 5: Zero-shot image generation examples with multimodal prompts.

Multimodal Language Modeling We use a batch size of 1.2 million tokens which is broken down as follows: 0.5 million tokens sourced from text corpora, 0.5 million tokens derived from image-caption pairs, and 0.2 million tokens from interleaved data sets. The MLLM is trained for 300,000 steps, corresponding to about 360 billion tokens in total. We adopt the AdamW optimizer with β = (0.9, 0.98). Furthermore, we configure the weight decay at 0.01 and the dropout rate at 0.1. The learning rate is set to escalate to 2e-4 during the initial 375 warm-up steps and decay linearly to 0 for the rest of the training steps. For optimization stability, we initiate using Magneto. We use SentencePiece [KR18] to tokenize the text. We preprocess the data in the “full-sentence” format [LOG+19], where each input sequence is populated with complete sentences consecutively sampled from one or multiple documents.

Image Decoder Aligning The AlignerNet undergoes training using a batch size of 3,584 sentences for 300,000 steps, with a maximum learning rate of 1e-3. This equates to approximately 1 billion sentences overall. The remaining configurations remain consistent with the previous stage.

Instruction Tuning The MLLM and AlignerNet are jointly trained with a batch size of 1,024 images, totaling approximately 200 million images over 200,000 steps. The learning rate peaks at 1e-3. The rest settings are the same as in the previous stage.

4 Evaluation

4.1 Main Qualitative Results

As shown in Figure 5, KOSMOS-G delivers impressive zero-shot generation results across diverse settings, yielding meaningful and coherent outputs even for highly customized subjects. The visual samples showcase generative capabilities in re-contextualization, stylization, modification, and accessory incorporation. Notably, multi-entity VL2I is very challenging even for fine-tuning methods like DreamBooth [RLJ+22]. While owing from the novel compositional generation instruction tuning, KOSMOS-G is the first model that is capable of achieving this in a zero-shot setting.

Table 1: Left: Quantitative comparisons on DreamBench. ∗ denotes zero-shot methods. Right: Zero-shot FID comparisons on MS-COCO. † indicates results evaluated by us under same settings and seed with KOSMOS-G.

4.2 Quantitative Results

We do quantitative evaluations of KOSMOS-G on DreamBench [RLJ+22] for single-entity subject-driven generation and MS-COCO [LMB+14] for text-to-image generation.

The DreamBench dataset contains 30 subjects and features 25 prompt templates, resulting in 750 unique prompts covering skills like re-contextualization, modification, accessorization, etc. We follow prior work to generate 4 images for each prompt to form the 3000 images for a comprehensive evaluation. We follow DreamBooth to adopt DINO, CLIP-I to evaluate the subject fidelity, and CLIP-T to evaluate the text fidelity. We use a classifier-free guidance scale of 7.5 and 100 DPM-Solver [LZB+22] inference steps for sampling. As shown in Table 1, zero-shot KOSMOS-G outperforms Textual Inversion and Re-Imagen and exhibits marginally better performance than DreamBooth and BLIP-Diffusion with only a single image input. Furthermore, Our results are also comparable with SuTI, without requiring expensive apprenticeship learning supervision. KOSMOS-G accepts only a single image as input, we select a clear image from the 4-7 provided images for each subject to avoid occlusion. We slightly modify the prompt template to ensure better alignment with the instruction tuning data. The images and prompt used can be found in Appendix A.

For the text-to-image generation, We generate images using 30,000 randomly sampled captions from the MS-COCO (2014) validation set. We use a classifier-free guidance scale of 3.0 and 250 DDIM [SME21] inference steps for sampling. As shown in Table 1, KOSMOS-G surpasses other CLIP-aligned VL2I models, delivering the optimal alignment results.

4.3 Ablation Studies

We conduct ablation studies to find out the importance of the image decoder aligning and instruction tuning. Table 2 demonstrates that direct end-to-end fine-tuning fails to generate meaningful images. Incorporating AlignerNet and CLIP supervision, however, results in outcomes very close to the original SD v1.5. We also compared the generation results from KOSMOS-G before instruction tuning and the standard SD v1.5 against our final model. As illustrated in Figure 6, without instruction tuning, KOSMOS-G can only generate contents semantically aligned with the vision-language input. SD baseline also remains at the semantic level and fails to faithfully reproduce the entities in the generated images.

Table 2: Ablation study results for image decoder aligning on MS-COCO.

4.4 Applications

As highlighted in Section 2.3, KOSMOS-G can seamlessly replace CLIP in any image generation system. This remarkable property unlocks a myriad of brand-new applications that have never been possible before. We demonstrate its integration with ControlNet [ZA23] and LoRA variants [HSW+22] in Figure 7.

Figure 6: Comparisons with cases presented in the second row of Figure 1.

(a) KOSMOS-G with canny control using ControlNet.

(b) KOSMOS-G with LoRA variant.

Figure 7: Various Applications of KOSMOS-G in Conjunction with U-Net Techniques. In Figure 7b, the left image is generated using standard U-Net, the right one is produced with LoRA-tuned U-Net.

KOSMOS-G works perfectly with these techniques. Building on the CLIP space, we believe our model will push forward the transition from text-conditioned generation toward vision-language generation, paving the way for numerous novel applications.

5 Conclusion

We propose KOSMOS-G, a model capable of high-fidelity zero-shot image generation from generalized vision-language input that spans multiple images. Our approach hinges on a unique “align before instruct” pre-training strategy. KOSMOS-G demonstrates competitive single-entity subject-driven image generation and text-to-image capability, it also stands as the first model to extend zero-shot subject-driven image generation to multi-entity scenarios. Furthermore, KOSMOS-G allows seamless replacement of CLIP, unlocking various new applications in conjunction with other U-Net techniques such as ControlNet and LoRA. In general, we present KOSMOS-G as a preliminary effort aimed at achieving the objective of “image as a foreign language in image generation.”

Previous: Mistral 7B Next: Model | Kosmos-2.5

post contain ""

    No matching posts found containing ""