00:00:00

Share Your Feedback 🏝️

MLLM | Vision Representation in MLLMs

MLLM | Vision Representation in MLLMs

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Survey | LFM for Music Next: Physics of Language Models

MLLM | Vision Representation in MLLMs

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-08-27

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

  • url: https://arxiv.org/abs/2408.16357
  • pdf: https://arxiv.org/pdf/2408.16357
  • html: https://arxiv.org/html/2408.16357v1
  • abstract: We present the “Law of Vision Representation” in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Previous: Survey | LFM for Music Next: Physics of Language Models

post contain ""

    No matching posts found containing ""