00:00:00

Share Your Feedback 🏝️

Web Images with LLaMA-3

Web Images with LLaMA-3

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Replit-3b Next: Attention | Infini-attention

Web Images with LLaMA-3

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-06-14

What If We Recaption Billions of Web Images with LLaMA-3?

  • url: https://arxiv.org/abs/2406.08478
  • pdf: https://arxiv.org/pdf/2406.08478
  • html https://arxiv.org/html/2406.08478
  • abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is this https URL.

Model: DataComp-1B


Contents

TL;DR


  1. 데이터 질 향상을 위한 이미지-텍스트 재설명: LLaMA-3와 LLaVA 모델을 사용하여 DataComp-1B 데이터셋의 이미지에 대한 텍스트 설명을 재생성하였다.
  2. 비전-언어 모델 훈련: 재설명된 데이터는 CLIP 모델과 텍스트-이미지 생성 모델의 성능을 개선시키는 데 사용되었다.
  3. 언어 이해 및 세맨틱 매칭 강화: 개선된 설명은 이미지와 텍스트 간의 정확한 매칭을 가능하게 하며, 모델의 언어 이해력을 높인다.

1 서론

최근 데이터의 폭발적 증가는 딥러닝의 큰 성공을 이끌었으며, 웹 크롤링을 통해 대규모의 데이터를 수집하는 방식이 널리 사용되었다. 하지만 이런 방식은 데이터의 품질 저하 문제를 야기하였고, 특히 이미지-텍스트 데이터의 경우 불일치 문제를 포함한다. 이를 해결하기 위해 휴먼 참여 시스템이나 자동화 파이프라인을 통한 후처리 과정이 필요하다고 강조되었다. 본 논문은 LLaMA-3를 사용하여 DataComp-1B 데이터셋의 이미지에 대한 텍스트를 재생성하는 방법을 제안하고 있다.


2 선행 연구

Vision-Language Foundation Models는 이미지와 텍스트를 연결하는 기초 모델로, CLIP 같은 모델이 수백만, 심지어 수십억의 이미지-텍스트 쌍을 학습함으로써 강력한 0-shot 능력을 발휘한다. 이미지-텍스트 데이터의 품질 향상을 위해서는, 데이터 필터링과 데이터 재설명 두 가지 접근 방식이 주로 사용된다. 본 논문은 특히 데이터 재설명에 초점을 맞추고 있으며, LLaMA-3와 LLaVA 모델을 활용하여 새로운 캡션을 생성하는 방법을 제시한다.


3 재설명 파이프라인

3.1 모델 상세

모델 구성에서는 LLaVA-1.5를 기반으로 하며, 언어 디코더로 LLaMA-3-8B를 사용하여 성능을 향상시켰다. 시각 인코더로는 CLIP ViT-L/14를 사용하고, 이를 통해 언어 임베딩 공간으로 시각적 특징을 투영하는 두 개의 MLP 레이어를 훈련한다.

3.2 재설명 DataComp-1B

이 단계에서는 LLaVA-1.5-LLaMA3-8B 모델을 사용하여 DataComp-1B 데이터셋의 이미지에 대해 자동 회귀 방식으로 새로운 캡션을 생성한다. 이 과정을 통해 텍스트의 품질이 향상되고, 이미지와의 정렬성도 개선된다.


4 Recap-DataComp-1B 분석

이 섹션에서는 생성된 캡션의 양적 분석을 수행한다. 단어 분포와 캡션의 평균 길이를 비교 분석하며, 새로운 데이터셋이 원본에 비해 어휘 다양성과 시퀀스 길이에서 우수함을 확인한다. 또한 CLIP 및 GPT-4V 모델을 사용하여 캡션의 의미적 품질을 평가한다. 결과적으로, 재설명 콘텐츠는 원본 대비 높은 세맨틱 매칭 점수를 달성하며, 이는 이미지와 텍스트 사이의 높은 일치도를 입증한다.


5 CLIP

CLIP 모델 훈련에 있어서, Recap-DataComp-1B 데이터셋을 사용함으로써, 모델은 향상된 0-shot 크로스-모달 검색 능력을 보여준다. 재설명된 캡션을 포함하는 다양한 비율로 실험을 수행하며, 이를 통해 데이터의 최적 혼합 비율을 결정한다. 결과적으로, 재설명 데이터의 비율이 증가함에 따라 텍스트 인코더의 크기를 확대하는 것이 모델 성능을 추가적으로 향상시키는 것으로 나타났다.


6 텍스트-이미지 생성

텍스트-이미지 생성 모델은 재설명된 캡션을 학습함으로써 생성 품질과 프롬프트 따르기 능력이 향상된다. Diffusion Transformers 모델을 사용하여 다양한 혼합 비율로 훈련을 진행하고, FID 및 CLIP 점수를 통해 모델의 성능을 평가한다. 결과적으로, 재설명된 데이터를 포함하는 훈련이 모델의 비주얼 품질과 텍스트 조건과의 정렬을 개선하는 데 기여한다는 것을 확인할 수 있었다.


1 Introduction

The exponential growth in data availability is one of the most paramount factors in driving the monumental successes of deep learning over the past decade [13, 32, 6, 57, 19, 17]. Typically, this data is sourced through web crawling with simple filtering mechanisms in place. While such an approach has facilitated large-scale data collection, exemplified by collections like LAION-400M [57] and LAION-5B [57] with billions of image-text records, it has inadvertently compromised data quality. As illustrated in Figure 1, these internet-crawled image-text pairs frequently exhibit misalignments between images and their corresponding textual content, and often, the textual descriptions are brief and lack detailed information.

To mitigate the noise present in web-crawled data, enhancements through post-processing— implemented via human-in-the-loop systems [61, 70] or automated pipelines [57, 28, 27]—are crucial, which help to train the advanced vision-language foundation models. Notably, both the close-sourced DALL-E 3 [41] and SORA [42] incorporate advanced captioning techniques to re-label their training datasets, a crucial step highlighted in their technical reports. Despite various efforts to open-source and replicate these methodologies [9, 28, 27, 35, 69, 16, 51], the community continues to face significant challenges in accessing high-quality, well-aligned image-text data at scale (e.g., at the billion level) for training advanced vision-language foundation models.

This paper endeavors to contribute to this community initiative, inspired specifically by the release of LLaMA-3 [62], a model demonstrating GPT-4-level capabilities across a variety of linguistic tasks. Additionally, recent studies have shown that leveraging LLaMA-3 can significantly enhance model performance on vision-language tasks [34, 65], comparable to those achieved by GPT-4V [1]. In response, we employ LLaMA-3 to develop our advanced captioner model. Our approach is straightforward: we first train a LLaMA-3-powered Llava model to act as an image captioner, which is then utilized to recaption the entire DataComp-1B dataset. As depicted in Figure 1, the resulting dataset, dubbed Recap-DataComp-1B, features enhanced textual descriptions and improved alignment with corresponding images, clearly surpassing its web-crawled counterparts. These quality enhancements are further quantitatively verified in Section 4.

Comprehensive evaluations highlight the significant improvements that Recap-DataComp-1B con- tributes to the training of advanced vision-language foundation models. Notably, this dataset enables CLIP models to achieve significant enhancements in their zero-shot cross-modal retrieval capabilities. It also enhances the alignment between generated images and text instructions in text-to-image generative models pre-trained on our dataset. We hope that the release of Recap-DataComp-1B will catalyze further developments in advanced vision-language foundation models, particularly encouraging the development within the open-source community.

Figure 2: The illustration of our recaptioning pipeline on DataComp-1B. We use LLaMA-3-powered LLaVA to reception images, which enables us to train stronger CLIP models and Text-to-Image Diffusion models.

Vision-Language Foundation Models. CLIP [47] is one of the pioneering foundation models to connect image and text. By training on millions, and even billions, of image-text pairs [6, 14, 17, 19, 56–59], CLIP markedly showcases excessively strong zero-shot capacities, and furthermore, lays the cornerstone for building more advanced vision-language foundation models [3, 28, 27, 63, 35, 34, 10, 4, 65]. Apart from discriminative vision-language understanding, text-to-image generation models [15, 40, 41, 45, 48–50, 53, 68] have transformed the field of AI-generated content, facilitating the creation of high-quality images from natural language descriptions.

Enhancing Image-Text Data. Web-crawled image-text data [57, 19, 17] commonly face the prob- lems of image-text misalignment and the low-quality of textual descriptions. Typically, there are two popular ways for improving the quality of these image-text pairs:

  • 1) data filtering removes misaligned image-text pairs using various methods such as cleaning strategies [56, 19, 64], pretrained models [28, 57, 19], and human-assisted systems [61, 70, 77];
  • 2) data recaptioning improves the textual quality of image-text pair via generating new captions, which is the focus of this paper.

To recaption data, LaCLIP [16] utilizes large language models (LLMs) like ChatGPT to rewrite the orig- inal captions; Nguyen et al. [39] employ BLIP2 [27] to recaption images. More recently, advanced large multimodal models have been applied to further enhance the quality of image captioning. For example, ShareGPT4V [9] employs GPT-4V to create highly descriptive captions from carefully crafted prompts and corresponding image inputs; the resulting dataset has significantly benefited the training of various models [7, 76, 12, 31, 18]. However, scaling such prompting with GPT-4V to billions of records is less practical, as it will drastically increase the monetary cost (of intensively calling OpenAI APIs) by more than 10,000×.

Our paper mostly follows the approach presented in [8, 38, 76, 12], where advanced open-source multimodal models like LLaVA [35] are employed for recaptioning purposes. However, our approach is distinguished by two major aspects: 1) we strongly enhance the LLM module in LLaVA, i.e., building with LLaMA-3; and 2) our recaptioning efforts are executed on a billion-scale dataset.

3 Recaptioning Pipeline

Our recaptioning pipeline is centered around the advanced LLM LLaMA-3 [62], which achieves exceptionally strong performance in language understanding, reasoning, code generation, math problems, etc. [11, 60]. Specifically, we utilize the LLaVA framework [35] to fully harness its capabilities for visual understanding. We describe the detailed training procedures below.

3.1 Model details

Model Configuration. We follow the setup of LLaVA-1.5 [33] to build our captioner model, except that we use LLaMA-3-8B as the language decoder because of its superior performance. The visual branch of CLIP ViT-L/14 [46] is used as the vision encoder. Two trainable MLP layers are employed on top of the vision encoder to project visual features into the language embedding space.

2-Stage Training. We also follow LLaVA-1.5 [33] for model training. Essentially we conduct instruction-tuning on the pre-trained LLM with its original auto-regressive training objective. In the first stage, only the projection MLP is trained; in the second stage, we fine-tune both the projection MLP and the language decoder. Note that the vision encoder remains frozen all the time. Following the protocols in LLaVA [33], 558k image-text pairings filtered from LAION [56], CC [6],and SBU [43] are used as training data in the first stage; then 665k instructions-following data from LLaVA-1.5 [33], containing image-grounded conversation, image descriptions, and image-based complex reasoning tasks, are used for the second stage of training. To help our model generate higher-quality captions, we use the image-text pairs from HQ-Edit dataset [21] for further tuning.

Evaluations. To probe the visual understanding and reasoning ability of our LLaVA-1.5-LLaMA3- 8B model, we opt for two comprehensive multi-modal evaluation benchmarks, MMMU [72] and MM-Vet [71]. These benchmarks assess a broad range of capabilities such as recognition, spatial awareness, OCR, knowledge, and language generation. As reported in Table 1, on both benchmarks, our LLaVA-1.5-LLaMA3-8B model surpasses the vanilla LLaVA-1.5-7B model by a significant margin. These results also substantially beat the considerably larger LLaVA-1.5-13B model, clearly demonstrating the superior visual understanding and reasoning ability of our model.

3.2 Recaptioning DataComp-1B

With this advanced LLaVA model, we next use it to generate captions in a scalable and detailed manner, given the visual input, and the following text prompt:

Please generate a detailed caption of this image. Please be as descriptive as possible.

As for the dataset, we opt for DataComp-1B [19], a widely accessible, large-scale vision-language dataset comprising ∼1.3 billion web-crawled image-text pairs. To ensure its quality, DataComp-1B is already a curated subset from a much larger collection of 12.8 billion image-text pairs and has been subjected to rigorous preprocessing which includes safety checks, deduplication, CLIP score filtering, and image-based filtering. Despite these efforts, the quality of the original captions in DataComp-1B still exhibits relatively low quality.

We apply our well-trained LLaVA-1.5-LLaMA3-8B model to recaption the entire DataComp-1B dataset. Specifically, captions are generated auto-regressively via greedy decoding, with the maximum output token length set to 128. We term this newly recaptioned dataset Recap-DataComp-1B.

4 Analyzing Recap-DataComp-1B

This section collects and presents a quantitative analysis of our generated captions on DataComp- 1B. We primarily focus on two aspects:

  • 1) the inherent features of the captions, including word distributions and average lengths; and
  • 2) the semantic quality of the captions, evaluated in terms of the matching similarity between images and captions and the inherent textual quality of the captions.

4.1 Word & Length Distribution

We begin our analysis by comparing the word frequency distributions between our recaptioned content and that from the original DataComp-1B, as illustrated in Figure 1, analyzing a randomly sampled subset of approximately 0.35 billion instances. Our findings reveal that the recaptioned content displays a considerably richer vocabulary, capturing 82.86% tokens of the word collections from both ours and the original caption data. Additionally, there is a noticeable variety in the usage of nouns and adjectives in our captions (e.g., “white” and “background”). We argue that this increased lexical diversity is a direct consequence of the extended length of our data. We thus present the distribution of caption lengths in Figure 3 to highlight this difference. On average, our recaptioned data demonstrates a longer sequence length of 49.43, whereas the original DataComp captions have a much shorter length of 10.22. These observations validate that our Recap-DataComp-1B surpasses the original DataComp-1B version in terms of both caption length and diversity.

Figure 3: Average length distributions of both the original captions and our recaptioned data in DataComp-1B.

4.2 GPT-4V & CLIP Evaluations

Next, we evaluate the semantic quality of recaptioned content using two models: 1) CLIP [47], which measures the semantic similarity between captions and images, and 2) GPT-4V [2], which assesses the fluency and alignment of captions with the given images.

For the CLIP evaluation, we analyze a subset of 180,000 image-text pairs. Interestingly, we note that, when using the standard CLIP-large model with ∼428M parameters for this measurement, our recaptioned content performs just comparably to the original captions (49.57 vs. 50.43). We attribute this result primarily to the limitations of the standard CLIP model, which is trained on ‘short’ captions and may inadequately capture the nuances in semantic similarity for longer captions. To probe deeper into semantic alignment between long captions and images, we utilize the LongCLIP-Large model [76], which is specifically fine-tuned to handle longer captions. With this setup, the LongCLIP score of our newly generated caption impressively attains 89.91, nearly 9× higher than the LongCLIP score of the original DataComp captions (i.e., only 10.09).

In addition, to evaluate both the textual quality and the alignment of the captions with their corre- sponding images, we randomly select 10,000 instances for GPT-4V [2] evaluation, employing the prompting strategy outlined below (CAPTION is the textual input), as per [44, 26].

We can observe that our recaptioned content achieves markedly superior ratings, registering an average rating increase of 0.43 (from 3.71 to 4.14). Together with the findings from Section 4.1, this confirms the superior quality of our newly generated captions, in terms of length, vocabulary diversity, semantics, and image-text alignment.

5 CLIP

CLIP [47] stands as a widely utilized vision-language model, where an image encoder and a text encoder are jointly trained to predict correct matches across entire batches of image-text pairs. In this section, we delve into the advantages of training CLIP models with our Recap-DataComp-1B dataset. We anticipate that CLIP models trained on this dataset will exhibit superior zero-shot cross-modal retrieval capabilities and enhanced text understanding, especially with long and complex textual inputs, given the improved quality of our recaptions.

5.1 Experiment settings

Training. For reference, we term the CLIP model trained on our Recap-DataComp-1B dataset as Recap-CLIP. Our training pipeline primarily follows CLIPA [30, 29], which incorporates a two- state training, i.e., a pre-training process with a small image size followed by a fine-tuning stage incorporating a larger image resolution. We set the text token length to 128 to accommodate the learning of long captions presented in Recap-DataComp-1B. We conduct experiments using three model scales: S/16, B/16, and L/16, with details listed in Table 2. The AdamW [37] optimizer is used for training. In the pre-training phase, the model is trained with 2.56 billion samples with a reduced image size of 112, including a warm-up phase involving 51.2 million samples. The batch size and base learning rate are set to 32,768 and 8e-6, respectively. For the subsequent fine-tuning phase, we increase the image size to 224 and train the model on 128 million samples with a 25.6 million sample warm-up. Here, we adjust the batch size to 16,384 and the learning rate to 4e-7.

Evaluation. The efficacy of Recap-CLIP is gauged via several metrics. We evaluate zero-shot image classification on the ImageNet-1K dataset [52] and assess zero-shot cross-modal retrieval performance using the validation set of MSCOCO 2014 [32] and the test set of Flickr30K [66]1, following the established practices [47, 30, 74, 75].

We present our results from three aspects. First, we explore the impacts of differing mix ratios between original captions and our enhanced recaptions on CLIP performance. Next, we analyze the effects of enlarging the size of the CLIP text encoder. Lastly, we investigate the text understanding capability of our Recap-CLIP, via testing on VG-Attribute [73], which evaluates attributes understanding ability, and Urban1K [76], which tests the model’s ability to handle long text.

5.2 Training with Mixed Captions

As pointed out by DALL-E 3 [41], blending both the brief genuine captions and the long informative generated captions can effectively prevent the model from unwanted overfitting to recaption data. Therefore, we hereby first study the effect of varying mix ratios between the original captions and our recaptions on the training of the Recap-CLIP B/16 model, as detailed in Table 2. Specifically, for each sample in a training batch, we randomly sample the original caption with probability $0 \leq p \leq 1$ and our captions with probability $1 - p$, referring to the mixed ratio:

\[\text{Caption} = \begin{cases} \text{Original} & \text{with probability } p \\ \text{Recaption} & \text{with probability } 1 - p \end{cases}\]

This strategy ensures that each batch contains a mixture of our recaption and the original captions controlled by probability p. The randomness allows each sample to encounter different captions across training epochs, potentially enhancing the model’s generalization.

1 We employ the widely used Karpathy split [24] of MSCOCO and Flickr30K.

Table 4: Train with larger text encoder. We set p = 0.8 for recaption-based models. We report zero-shot top-1 accuracy on ImageNet-1K and top-1 recall on COCO and Flickr30K.

Main results. Our findings are summarized in Table 3. We observe that reducing the mixed ratio p (i.e., increasing the proportion of our recaption data) initially leads to an improvement followed by a decline in cross-modal retrieval performance. This initial improvement suggests that high- quality recaptioned data effectively enhances contrastive learning. However, the subsequent decrease indicates that the original captions from DataComp-1B provide necessary training regularization, preventing the model from overly adapting to the specific qualities of the recaption data. Interestingly, we also observe that the performance of CLIP is relatively insensitive to certain variations in the mix ratio p, as evidenced by the consistent enhancement over the baseline (i.e. p=1.0) across all four different cross-modal retrieval metrics when varying p from 0.2 (80% recaption data) to 0.9 (10% recaption data). For instance, setting p at 0.9 and 0.2 both yields a similar performance enhancement of ∼3.5%, with the peak performance occurring at p=0.5, which delivers a substantial ∼5% boost.

But meanwhile, we note that incorporating our recaptions (negatively) affects the zero-shot classi- fication task, exemplified by the consistent performance degradation across varying p values from 0 to 0.9. The phenomenon is similarly observed in the recent work [76] where they note directly fine-tuning on long text can significantly hurt the CLIP performance and therefore propose several techniques for enhancing learning with long texts. In this study, given our primary focus is on assessing the quality of Recap-DataComp-1B, we choose the ratio p = 0.8 to strike a promising balance between the classification performance (i.e., only marginally drops 0.5%) and the cross-modal retrieval performance (i.e., with a significant 3.4% boost on average) for later ablations.

5.3 Training with Larger Text Encoder

We hereby investigate how the size of the text encoder affects models trained on a mixture of the original captions and our recaptions (with p = 0.8). Specifically, we keep the architectural configuration of the vision branch as in Table 2 and only twitch the text encoder. For instance, in the case of the S/16 model, we change from a smaller text encoder with 33M parameters to a larger, base-sized one with 53M parameters.

Main Results Our results, as shown in Table 4, highlight that enlarging the text encoder can further enhance performance across all model scales. The average improvement for adopting a larger text encoder in retrieval tasks is 1.4%, 1.0%, and 1.5% for small, base, and large models, respectively, suggesting that larger text encoders can help the CLIP model learn better from semantically rich captions.

Table 5: Larger text encoder with different mixed ratios. We choose Recap-CLIP-B/16 with large text encoder for this ablation.

Moreover, we re-assess the balanced ratio of recaption data using a larger text encoder. Specifically, we gradually increase the ratio of recaption data from 20% to 50%, utilizing the Recap-CLIP-B/16 model with the large text encoder. The results are presented in Table 5. Compared to the prior results where an optimal ratio is achieved at p = 0.8, using a larger text encoder can further push this optimal ratio to p = 0.6. In other words, this result concludes that, compared to the vanilla version, a stronger cross-modal retrieval performance can be achieved if 1) more recaptions are used and 2) a larger text encoder is used.

5.4 More evaluations on text understanding

Recent works demonstrate that CLIP models suffer from poor long context understanding and delicate attribute understanding [73, 76]. Given the long, enriched, and better-aligned captions, we expect Recap-CLIP to exhibit better text understanding capability. Thus, we evaluate our Recap-CLIP model on two benchmarks: (1) Urban1K [76], a long-caption image-text retrieval benchmark that contains 1k urban images and corresponding GPT-4V captions; (2) VG-Attribution [73], a modified version of Visual Genome [25] to test model abilities to attribute properties to objects. The results are shown in Tab. 6.

We observe consistent significant improvement if the model is trained on our Recap-Datacomp-1B dataset. For both text-to-image and image-to-text retrieval on Urban-1K dataset, our Recap-CLIP models surpass the vanilla baseline by at least 19% and sometimes up to an astonishingly high 36%. On the VG-attribution dataset, it is worth noting that our Recap-CLIP brings a performance boost very close to that of the NegCLIP fine-tuning [73] (e.g. ∼9% vs. 10%), a lightweight downstream fine-tuning process designed to boost CLIP ability to understand attribute and order. Nonetheless, it is noteworthy that our Recap-CLIP is naturally equipped with better text understanding ability, without any specific targeted fine-tuning, indicating the importance of better captions in web-scale data.

6 Text-to-Image Generation

It has been known to the research community that training with generated (high-quality) pseudo- captions improves text-to-image generative models in terms of generation quality and prompt fol- lowing ability [8, 7, 5], primarily due to the low information and high noise density presented in the original web-crawled captions. Therefore, we evaluate the quality of our generated captions by training Text-to-Image (T2I) generative models on Recap-DataComp-1B for further justification. We expect the enriched information in the generated descriptions to better align the visual content in images, and thus improve the performance of the T2I models.

Table 7: Text-to-Image evaluation on COCO-30K results of DiT-BASE/4, trained with different mix ratios on Recap-DataComp-1B. Note for GPT-4V Score, we use a subset of 3K for the evaluation.

Training. We adopt Diffusion Transformers (DiT) [45] as our T2I model, where the text condition is firstly extracted with a CLIP text encoder [47], and then injected into each DiT block with the cross-attention design. Specifically, we follow the image preprocessing pipeline in DiT [45], where the images are preprocessed to have a square resolution of 256. The model is trained on visual latent extracted using a pretrained auto-encoder with a downsampling ratio of 8 [50]. Similar to the setup in previous experiments, the training text consists of a mixture of raw captions from Datacomp-1B, with a specified proportion p, and the rest of the captions replaced by refined captions from Recap- Datacomp-1B. Moreover, the training batch size is 2048, and the AdamW optimizer [37] is used with a constant 1e-4 learning rate, without any warm-up schedule or weight decay. We name the resulting model Recap-DiT.

Evaluation. For sampling, we set the classifier-free guidance scale as 10 and use 250 DDPM steps to generate 30k images with captions from MSCOCO and our improved generated captions for zero-shot generation evaluation. We calculate Fréchet Inception Distance (FID) [20] with the reference images from MSCOCO [32] and CLIP score with both OpenAI ViT-B/32 model [47] and our own Recap- CLIP ViT-L/16 model, following the established pipeline in prior T2I works [5, 67, 54, 23, 36, 78, 55]. Additionally, following the GPT-4V metric introduced in Section 4.2, we randomly select a subset of 3,000 our generated images for GPT-4V evaluation.

Main results. We report our observations in Tab. 7. Interestingly, when using raw COCO captions to generate 30,000 images for evaluation, the model trained with data integrated with our Recap- Datacomp (for $p < 1$) demonstrates a better CLIP score, indicating improved vision-language alignment. However, there is no significant improvement observed in terms of FID. Our hypothesis is that the model adapts to the more informative and descriptive prompts, and could unleash its full potential only when similar informative testing prompts are provided.

Therefore, in another setting, we evaluate images generated using our LLaVA-1.5-LLaMA3-8B recap- tioned version of the raw COCO captions. Here, we observe consistent and significant improvements in both FID and CLIP scores, particularly when more than half of the recaptioned data are integrated into the training dataset. Notably, models trained on Recap-Datacomp-1B ($p = 0$) surpass those trained on the vanilla Datacomp-1B ($p = 1$) by a large margin, with improvements observed in FID (-8.4), CLIP score (+3.1), Recap-CLIP score (+8.4), and GPT-4V score (+1.1). These observations justify that Recap-Datacomp-1B better reveals the potential of text-to-image models in generating images with high visual quality and improved alignment with textual conditions.

Larger models. We further train a larger model, DiT-L/2, for 1 epoch with a mixed ratio of $p = 0.0$, while keeping other training parameters unchanged. The model achieves an FID of 25.14 and a CLIP Score of 34.82. In Figure 4, we visually compare the generated results of DiT-L/2 and DiT-B/4 at $p = 0.0$. It is evident that although the quantitative scores may not show substantial improvement, as we scale up the model, there is a noticeable enhancement in the alignment between the generated images and the corresponding text, i.e., this improved alignment results in higher-quality images that are able to capture and express more intricate details. These results confirm that DiT models trained on our recaption DataComp-1B exhibit robust scalability for text-to-image generative tasks.

Figure 4: Visual comparison of generate results from DiT-L/2 and DiT-B/4 at $p = 0.0$, DiT-L/2 has better text comprehension and image generation than DiT-B/4. We mark entities in the instruction.

7 Conclusion

This paper introduces Recap-DataComp-1B, a large-scale image dataset paired with detailed textual descriptions, generated using the LLaMA-3-powered Llava model. Our comprehensive analysis reveals that, compared to the original, web-crawled textual data, these generated descriptions align more accurately with their corresponding images and are more detailed. Utilizing Recap-DataComp- 1B for training resulted in consistent enhancements across various models, notably CLIP, particularly in image-to-text and text-to-image retrieval tasks, and in text-to-image Diffusion models, specifically in their ability to follow more closely to user-provided text instructions. By providing this high- quality, publicly available, large-scale image-text dataset, we hope to inspire ongoing research and development that will push the boundaries of developing vision-language foundation models, more particularly in the open-source community.

Previous: Model | Replit-3b Next: Attention | Infini-attention

post contain ""

    No matching posts found containing ""