LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

Created: 2024-12-20 11:45:54 +0000

Last modified: 2024-09-20 23:24:50 +0900

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

url: https://arxiv.org/abs/2412.15188

pdf: https://arxiv.org/pdf/2412.15188

abstract: We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with VLM generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3’s weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain VLM generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3’s language capabilities. We also demonstrate that this framework can adapt existing vision-language models with VLM generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient VLM model development.

LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

LlamaFusion | Adapting Pretrained Language Models for Multimodal Generation

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views