Contents
1. 서론
대규모 언어모델(LLMs)은 기계가 휴먼의 언어를 이해하고 표현하는 능력을 향상시켰다. LLM의 발전은 이미지 인코더와 LLM을 결합하여 인퍼런스 능력을 통합한 비전-언어 분야에서도 두드러진 성과를 보여주었다 [1, 2, 3, 4]. 기존의 멀티모달 LLM 연구는 텍스트와 한 가지 다른 모달리티(e.g., 텍스트와 이미지 모델) 또는 비공개 언어 모델에 집중해 왔다 [2, 4]. 이런 문제를 해결하기 위해 Any-Modality Augmented Language Model(AnyMAL)을 소개한다. 이는 이미지, 동영상, 오디오, IMU 모션 센서 데이터 등의 다양한 모달리티 데이터를 LLM의 텍스트 임베딩 공간으로 변환하도록 훈련된 멀티모달 인코더 모음이다.
주요 기여
2. 관련 연구
대규모 언어모델(LLM)
최근 다양한 모델 크기의 LLM이 등장하며 향상된 인퍼런스 능력을 보여주었다. 대표적인 상용 서비스로는 ChatGPT [4, 7], 오픈 소스 모델로는 FlanT5 [8], GPT-J [9], OPT [10], LLaMA [11], Vicuna [12], LLaMA-2 [6] 등이 있다. 본 연구는 이런 강력한 텍스트 기반 인퍼런스 능력을 다양한 멀티모달 입력으로 확장하는 것을 목표로 한다.
비전-언어 모델
다양한 연구가 비전과 언어 요소를 통합한 모델을 개발해 왔으며, 이미지 캡셔닝 [13]과 비주얼 질문 응답(VQA) 작업 [14, 15, 16]에서 실용적인 구현을 찾았다. 기존 연구의 병목 현상은 다양한 모달리티를 정렬하는 데이터 소스의 부족이었으나, 최근 연구는 pre-trained LLM의 능력을 활용하는 방향으로 전환되고 있다. 본 연구는 이런 접근 방식을 확장하여 시각 신호를 넘어 다양한 입력 모달리티를 허용하고, 수작업으로 수집한 멀티모달 인스트럭션 튜닝 데이터로 모델을 세밀하게 튜닝하며, LLM 파라미터를 70B로 확장하는 효율적인 사전 훈련 방법을 제시한다.
3. 방법
3.1 사전 훈련
모달리티 정렬
멀티모달 이해 능력을 달성하기 위해 LLM과 모달리티별 신호 및 텍스트 내레이션이 페어링된 데이터를 사용하여 사전 훈련을 수행한다(Figure 2 참조). 각 모달리티에 대해 가벼운 어댑터를 훈련하여 입력 신호를 특정 LLM의 텍스트 토큰 임베딩 공간으로 투영한다. 이 방식으로 LLM의 텍스트 토큰 임베딩 공간은 텍스트 또는 다른 모달리티를 나타내는 공동 토큰 임베딩 공간이 된다. 각 모달리티를 나타내는 토큰 임베딩의 수는 어댑터별로 고정되어 있으며, 이 작업에서는 64에서 256개 범위로 사용된다. 정렬 훈련 동안 기본 LLM의 모델 파라미터를 고정하여, 모델이 처음부터 끝까지 훈련되는 것보다 빠르게 수렴할 수 있도록 하고, 인퍼런스 시 LLM의 인퍼런스 능력을 상속받을 수 있게 한다. 또한, 각 모달리티에 대해 텍스트 임베딩 공간에 이미 정렬된 인코더(e.g., CLIP [30, 31] for images, CLAP [32] for Audio signals, or IMU2CLIP [33] for IMU signals)를 사용하여 특징 호환성을 극대화한다. 각 텍스트 캡션과 모달리티 쌍 $(X_{text}, X_{modality})$에 대해, 다음과 같은 목적을 사용하여 투영 모듈(e.g., 비전 인코더의 Perceiver Resampler [2], 다른 모달리티의 경우 선형 레이어)을 사용하여 이를 정렬한다. ($g(\cdot)$는 모달리티 인코더, $\text{proj}(\cdot)$는 투영 모듈)
\[L_{alignment} = \sum_{i} \left( \text{distance}(g(X_{text}^i), \text{proj}(g(X_{modality}^i))) \right)\]데이터셋
이미지 정렬을 위해 LAION-2B 데이터셋의 정제된 하위 집합을 사용하고, 얼굴이 감지된 이미지는 CAT 방법을 사용하여 흐리게 처리하였다 [34]. 오디오 정렬을 위해 AudioSet [35] (2.1M 샘플), AudioCaps [36] (46K 샘플), CLOTHO [37] (5K 샘플) 데이터셋을 사용하였다. IMU와 텍스트 정렬을 위해서는 Ego4D 데이터셋 [38] (528K)을 사용하였다.
양자화
70B 파라미터 모델을 대규모 데이터셋(2억+ 인스턴스)으로 사전 훈련하려면 상당한 자원이 필요하며, 종종 FSDP [39] 래퍼가 여러 GPU에 모델을 분산시키는 데 필요하다. 4비트 및 8비트 양자화 전략을 구현하여 메모리 요구사항을 10배 줄이고, 70B AnyMAL을 단일 80GB VRAM GPU에서 배치 크기 4로 훈련할 수 있었다. FSDP와 비교하여, 제안된 양자화 접근 방식은 절반의 GPU 자원만 사용하여 동일한 처리량을 달성하였다.
3.2 멀티모달 인스트럭션 데이터셋을 활용한 파인튜닝
모델의 다양한 입력 모달리티에 대한 인스트럭션 팔로우 능력을 향상시키기 위해 멀티모달 인스트럭션 튜닝(MM-IT) 데이터셋을 사용하여 추가 파인튜닝을 수행한다. 구체적으로, 입력을 [
레이블링 수작업
공개된 다양한 VQA 작업 데이터셋이 있지만, 이런 데이터는 멀티모달 인스트럭션 팔로우 작업에 필요한 다양성과 품질이 부족하다. 60K 예제의 고품질 멀티모달 인스트럭션 튜닝 데이터를 수집하고, 레이블링 수작업을 통해 각 모달리티별 엄격한 멀티모달 인스트럭션-응답 쌍을 제공하였다.
4. 실험
4.1 작업
모델의 성능을 두 가지 작업에서 평가한다. (1) 다양한 모달리티에 대한 캡셔닝 작업, (2) 멀티모달 인퍼런스 및 인스트럭션 팔로우 작업.
캡셔닝 작업
AnyMAL의 주된 능력인 입력 모달리티에 대한 캡션 생성을 평가한다. 이 작업의 주된 목적은 사전 훈련 후 텍스트와 다른 모달리티 간의 정렬 수준을 이해하는 것이다.
멀티모달 인퍼런스 작업
높은 수준의 모달리티 정렬을 바탕으로 모델의 인퍼런스 및 인스트럭션 팔로우 능력을 평가한다. 공개된 문헌에서 각 모달리티 쌍(비전-언어 및 오디오-언어)에 대한 강력한 Baseline Model들과 포괄적인 비교를 수행한다.
4.2 정량 분석
이미지 캡션 생성
테이블 2는 COCO [48]와 MM-IT 데이터셋의 하위 집합에서 0-shot 이미지 캡셔닝 성능을 보여준다. AnyMAL 변종은 두 데이터셋 모두에서 Baseline Model을 능가한다.
멀티모달 인퍼런스 작업의 휴먼 평가
MM-IT 데이터셋은 다양한 멀티모달 인스트럭션과 그라운드 트루스 응답 쌍을 특징으로 한다. 모델의 성능을 Baseline Model들과 비교하여 평가한다. 응답의 정확성, 객체 인식 정확성, 응답의 완전성 측면에서 평가 기준을 설정하고, 사람의 평가를 통해 가장 정확한 통찰력을 제공한다.
4.3 정성 분석
다른 비전-언어 모델과의 비교
AnyMAL은 객체 인식과 언어 생성 능력에서 강력한 성능을 보여준다.
AnyMAL은 다양한 모달리티 데이터를 효과적으로 정렬하고 텍스트 임베딩 공간으로 변환하여 멀티모달 인퍼런스 및 인스트럭션 팔로우 작업에서 향상된 성능을 보여준다.
Large Language Models (LLMs), known for their substantial size and complexity, have significantly enhanced the capacity of machines to understand and articulate human language. The progress in LLMs has also led to notable advancements in the vision-language domain [1, 2, 3, 4], bridging the gap between image encoders and LLMs to combine their reasoning capabilities. Prior multimodal LLM research has concentrated on models that combine text and one other modality [3, 5], such as text and image models, or has centered on proprietary language models that are not open sourced [2, 4]. To tackle the previously mentioned challenges, we introduce Any-Modality Augmented Language Model (AnyMAL) — a collection of multi-modal encoders trained to transform data from various modalities, including images, videos, audio, and IMU motion sensor data, into the text embedding space of an LLM. To achieve this, we extend the work by [1] to (1) more capable instruction-tuned LLMs (i.e. LLaMA-2-70B-chat [6]), (2) larger pre-trained modality encoders, and (3) advanced projection layers to handle variable input lengths. The model output examples are shown in Figure 1, and an illustration of the overall methodology is shown in Figure 2. The key contributions of the work are as follows:
Figure 1: Example AnyMAL outputs. The model understands various input signals (i.e. vision, audio, motion sensor signals), and responds to free-form user queries. When multiple modalities are interleaved and given as input (e.g. right-most: image + IMU motion sensor signals), the model reasons over them jointly.
Large Language Models (LLM): There has been a surge of LLMs with varying model sizes recently, showcasing remarkable reasoning capabilities. While the most well-known commercial service is ChatGPT [4, 7], the open-sourced models include FlanT5 [8], GPT-J [9], OPT [10], LLaMA [11], Vicuna [12], and more recently, LLaMA-2 [6]. Our work builds upon the powerful text-based reasoning capabilities of these LLMs, extending these capabilities to multimodal inputs. Vision-Language Models: Numerous studies have addressed the task of instructing a unified model that integrates both visual and linguistic elements, finding practical implementations in domains like image captioning [13] and visual question answering (VQA) tasks [14, 15, 16]. While the relative scarcity of data sources aligning different modalities has conventionally been considered the bottleneck in scaling, recent works have shifted towards harnessing the capabilities of pre-trained LLMs, tapping into the knowledge accrued from extensive textual corpora. These work include Flamingo [2], OpenFlamingo [17], Palm-E [18], BLIP-2 [3], InstructBLIP [19], LLaVA [20], IDEFICS [5], MiniGPT-4 [21] and many more [22, 23, 24, 25, 26, 27, 28], where each model uses different variants of base LLMs. These models typically undergo fine-tuning stages as well, re-purposing several task-specific vision-language datasets [20, 29]. Our work extends the previous approaches by (1) allowing for diverse input modalities beyond vision signals, (2) presenting a fine-tuning process with our manually collected multimodal instruction tuning data, and (3) scaling the LLM parameters to 70B via an efficient pre-training approach.
Figure 2: AnyMAL Training. (a) Modality alignment pre-training allows for mapping the output of each modality encoder into the joint LLM embeddings space through projection layers. (b) With multimodal instruction tuning, the model learns to associate system instructions and text queries with input multimodal contexts. Our modality-specific encoder zoo includes: CLIP ViT-L, ViT-G, DinoV2 (image), CLAP (audio), IMU2CLIP (IMU motion sensor), and Intervideo (video).
Modality Alignment: We achieve the multimodal understanding capabilities by pre-training LLMs with paired multimodal data (modality-specific signals and text narrations) (Figure 2). Specifically, we train a lightweight adapter for each modality to project the input signals into the text token embedding space of a specific LLM. In this way, the text token embedding space of the LLM becomes a joint token embedding space, with tokens representing either text or other modalities. The number of token embeddings used to represent each input modality is fixed per adapter, ranging from 64 - 256 in this work. During the alignment training, we freeze the model parameters of the underlying LLM, which allows it to reach convergence faster than training end-to-end from scratch, and to inherit the reasoning capabilities of the LLM at inference time. In addition, to maximize the feature compatibility, for each modality we use an encoder g(·) that has already been aligned to a text embeddings space, e.g. CLIP [30, 31] for images, CLAP [32] for Audio signals, or IMU2CLIP [33] for IMU signals. For each text caption and modality pair (Xtext, Xmodality), we align them using the following objectives with a projection module (i.e. Perceiver Resampler [2] for vision encoder, and linear layers for other modalities).
Datasets: For image alignment, we use a cleaned subset of the LAION-2B dataset, filtered using the CAT method and with any detectable faces blurred [34]. For audio alignment, we use AudioSet [35] (2.1M samples), AudioCaps [36] (46K samples), and CLOTHO [37] (5K samples) datasets. We use the Ego4D dataset [38] for IMU and text alignment (528K). Quantization: Scaling the pre-training to 70B parameter models for a large dataset (200M+ instances) requires significant resources, often requiring an FSDP [39] wrapper to shard the model across multiple GPUs. To effectively scale our training, we implement the quantization strategies (4 bits and 8 bits) [40] in our multimodal settings, in which we keep the LLM component of our model frozen and only have the modality tokenizers trainable. This approach shrinks the memory requirement by an order of magnitude. Thus, we are able to train 70B AnyMAL on a single 80GB VRAM GPU with a batch size of 4. Compared with FSDP, we observe that the proposed quantization approach achieves the same throughput while only using half of the GPU resources. We do note that the training / validation loss were constantly higher compared to the FSDP training, but nevertheless did not impact the generation quality (at inference time, we use the original LLM at full precision to maximize the accuracy).
Example 1. Creative Instructions (Image)
Instruction Prompt Write a short story about the scene in this photo. Include some dialogue between the characters. Response The restaurant table stood empty but adorned with remnants of a meal. A seagull landed on its edge, spreading his wings in a display of dominance. With a raucous cry, he fiercely guarded the plate before him, which he intended to make his dinner. His piercing gaze fixed upon another seagull, wings poised for landing, his eyes locked on the feast. “Back off!” the first seagull screeched. Beak open wide, he emitted a symphony of squawks, challenging the attacker. Undeterred, the second seagull hesitated, torn between his desire for snacks, and fear of the defending seagull. The tension hung in the air, and as the first seagull prepared to yell again, the second replied with a sullen “Fine,” and flew off. Triumphant, the first gull tucked into his dinner.
Example 2. Detailed Description (Image)
Instruction Prompt Explain all the details of the scene, starting from left to right. Response This is a photo of a small delivery truck. In the background of the photo is a loading dock of some type with green and yellow structures under an awning. The small truck stands at the center of the photo. It has a flatbed and awning over the flatbed and some type of white cabinet on the back and a black storage compartment under. The bed is green and has black and green milk crates stacked on it. The cab of the truck is small and tall with large windows all around. It is white with a green lower section and fine yellow stripe separating the green and white paint. The front of the truck has ‘Dairy Crest’ painted on it in green and has a European style license plate. There is a partial image of a man in the background.
Example 3. Open-ended Question and Answering (Video)
Instruction Prompt What is this animal doing and why are they doing it? Response Pheasants eat gravel to help them in the digestion of their food. This helps them survive multiple weeks without feeding during extreme weather. A newly hatched chick will survive on insects for a few weeks of its life.
Example 4. Multimodal Reasoning (Audio)
Instruction Prompt What could be the likely momentous occasion in this sound clip? Response This is a countdown to a rocket taking off into space by a robotic voice.
Table 1: Example data from our MM-IT dataset, manually collected instruction prompt and response pairs for diverse modalities (i.e. image, video, audio). The collected instructions cover diverse domains (e.g. creative writing, open-ended reasoning), and are strictly grounded to the provided multimodal context (i.e. queries cannot be answered without understanding the accompanying multimodal context). The MM-IT data serves as both a fine-tuning dataset as well as an evaluation benchmark for complex multimodal reasoning tasks.
To further improve the model’s instruction-following capability with respect to diverse input modalities, we perform additional fine-tuning with our multimodal instruction-tuning (MM-IT) dataset. Specifically, we concatenate the input as [
We evaluate the model’s performance on two categories of tasks in the zero-shot setting: (1) captioning tasks for various modalities, and (2) multimodal reasoning and instruction-following tasks. Captioning Tasks. We evaluate AnyMAL’s primary capability of generating captions given input modalities, which is aligned with the pre-training objective. The main purpose of the captioning task is to understand the alignment level between the text and other modalities after pre-training. Since the captioning tasks typically don’t require secondary reasoning steps, we expect that LLM weights or parameter sizes have less influence on the task. Multimodal Reasoning Tasks. Given the high-level of alignment among the modalities, we evaluate the model’s reasoning and instruction-following abilities which it inherits from the core instruction-tuned LLM, as well as from the multimodal instruction-tuning process. We conduct a comprehensive comparison with strong baseline models for each respective modality pair (vision-language and audio-language) from the open-sourced literature. Note: As the MM-IT datasets include some in-domain images from public benchmarks (e.g. COCO), we report results separately for the pre-trained models (without further instruction tuning in Section 3.2) and the instruction-tuned models – to denote a strict zeroshot setup. All multimodal-instruction-tuned AnyMAL models are marked with “MM-IT” in the following sections.
Image Caption Generation: Table 2 shows zeroshot image captioning performance on COCO [48] and a subset of the MM-IT dataset marked with the “detailed description” task (MM-IT-Cap). It can be seen that our AnyMAL variants significantly outperform the baselines in both datasets. It is worthwhile to note that there is no significant gap between the performance of the AnyMAL-13B and the AnyMAL-70B variants. This result indicates that the underlying LLM capability has smaller impact to the image caption generation task (which corresponds to the core visual understanding capability), but is largely dependent on the scale of the data and the alignment methods. We attribute
Table 2: Zeroshot Image Captioning performance on COCO and MM-IT-Cap. Ablations (bottom) over our AnyMAL with varying LLM sizes. Bold and underlined denote the top and the second-best performance, respectively. “-”: the model (a) does not report results on the marked benchmarks, or (b) is pretrained or fine-tuned on the respective dataset, thus not suitable for the zeroshot evaluation above. AnyMAL demonstrates the state-of-the-art zeroshot visual understanding capabilities compared to the baseline vision-language models.
Figure 3: Image-based reasoning human evaluation results on pairwise comparisons (% win, tie and lose) with baseline outputs against the manually annotated ground-truth samples from MM-IT (1K test set). Baselines used: BLIP-2 (FlanT5XXL) [3], InstructBLIP (Vicuna-13B) [19], MiniGPT4 [21] and LLaVA [20]. AnyMAL demonstrates a smaller gap with human-generated responses (41.1% win), compared to the baselines (LLaVA: 34.4% win, and MiniGPT4: 27.0%).
the slight under-performance of the AnyMAL-70B on COCO to the general verbosity of the LLaMA-70B model, which negatively impacts the score when evaluated against COCO captions that tend to be brief and concise. As expected, the automatic evaluation on MM-IT-Cap shows lower CIDEr scores overall, attributed to the much longer response length in detailed descriptions (See Table 1 for an example). Human Evaluation on Multimodal Reasoning Tasks: MM-IT features diverse multimodal instruction and ground-truth answer pairs. We evaluate the performance of our models (pre-trained and instruction-tuned) against other vision-language models publicly available to run and use (i.e. LLaVA [20], MiniGPT4 [21]). Since the responses are subjective in nature (e.g. creative writing – “Write a poem about this image”, we believe that human assessment provides the most precise insight into the performance and capabilities of our proposed model. We therefore collect pairwise comparisons for each baseline against 1K ground-truth samples (Figure 3), as well as the Likert scale scores (0-2) for each of the following criteria. The criteria for preference ranking includes response accuracy, object recognition accuracy, and integrity (see the full rubrics in Appendix A). Response accuracy measures whether the response contains the relevant, factually correct and verifiable information (without any hallucinations) with regards to the image and the instruction. Object recognition accuracy strictly measures whether the key objects are correctly recognized at a detailed level – primarily concerning the model’s visual knowledge. Finally, the integrity metric measures whether the response shows any harmful or offensive language. Figure 3 shows that AnyMAL achieves strong performance with a narrower gap against the manually annotated ground-truth samples (41.1% win), compared to the baselines (LLaVA : 34.4% win, and MiniGPT4: 27.0% win). Notably, the model fine-tuned with the full instruction set exhibits the highest rate of preferential wins, showing a competitive level of visual understanding and reasoning capabilities comparable to human-annotated responses. It is also worthwhile to note that BLIP-2 and InstructBLIP suffer on these open-ended queries (4.1% and 16.7% preferential win, respectively), despite their strong performance in the public VQA benchmarks (Table 4).3
AnyMAL 70B | AnyMAL 70B (MM-IT Synth Only) | AnyMAL 70B (MM-IT Human+Synth) |
---|---|---|
43.3 | 46.3 | 42.7 |
56.0 | 54.2 | 58.0 |
73.5 | 73.2 | 73.0 |
82.4 | 83.5 | 79.3 |
99.3 | 98.3 | 99.5 |
99.3 | 99.5 | 99.7 |
Table 3: Image-based Reasoning human evaluation results on 1K test set from MM-IT on different axes: (a) Response Accuracy and Relevance (%) – whether responses are relevant to instructions and factually correct without any hallucinations, (b) Object Recognition (%) – whether key objects are identified at a detailed level, and (c) Integrity (%) – whether responses include offensive language. MM-IT indicates the model that has been instruction-tuned either with synthetic data only, or with the manually collected set (Section 3.2).
Table 4: Zeroshot Image-based QA results on 6 different VQA datasets (H-Meme: Hateful Meme, S-QA: Science QA). Ablations (bottom) over AnyMAL with varying base ViTs and LLM sizes. MM-IT (last row) denotes the model fine-tuned on our instruction dataset. Bold and underlined denote the top and the second-best performance, respectively. AnyMAL demonstrates competitive zeroshot multimodal reasoning capabilities, compared to the baseline vision-language models. *: Results with additional OCR inputs. †: in-domain images (i.e. COCO, TextCap) have been used during training, thus not a strict zeroshot performance.
Table 5: Zeroshot Audio Captioning results on AudioCaps. Ablations (bottom) over our AnyMAL with varying base LLMs and sizes. AnyMAL attains the best performance across multiple metrics, showing the model’s strong performance in audio signal understanding.
Table 6: Zeroshot Video-based QA accuracy on STAR, How2QA, and NextQA. Ablations (bottom) over AnyMAL with image vs video model and LLM sizes. AnyMAL demonstrates competitive zeroshot multimodal reasoning capabilities, compared to the baseline vision-language models. Numbers in brackets denote number of frames used per video clip.
Table 3 shows the detailed breakdown of scores for each criteria. Specifically, it can be seen that the model instruction-tuned with both manually collected and synthetically curated sets attains the highest response accuracy and relevancy score (12.2% relative improvement compared to the strongest baseline: LLaVA). This result highlights the enhanced capability of the model to comprehend and precisely answer questions in accordance with provided instructions. It is worth mentioning that the model, after instruction tuning, exhibits a decline in its ability to provide detailed recognition and descriptions of objects. We attribute this result to our intention behind collecting the AnyMAL dataset which promotes concise responses. Consequently, the model tends to provide more succinct descriptions, rather than elaborating on details (which often increases the risk of factual inaccuracies). High integrity scores (99+%) are achieved by all baselines. VQA Benchmarks: Table 4 shows the zeroshot performance on the Hateful Meme dataset [49], VQAv2 [14], TextVQA [50], ScienceQA [51] (image-paired subset), VizWiz [52], and OKVQA [53] compared against the models in the literature that report zeroshot results on the respective benchmark. We focus on zeroshot evaluation to best estimate the model’s performance on the open-ended queries at inference time. Overall, our AnyMAL exhibits the strongest performance compared to the baselines across multiple tasks. Notably, the pretrained AnyMAL models show competitive performance even without further fine-tuning on MM-IT– demonstrating strong reasoning capabilities retained in the base pre-training phase. Comparing the two text-guided vision encoders ViT-L [30] and ViT-G [31], ViT-G achieves higher scores in most of the vision datasets. These results suggest that larger encoders trained over more steps offer better representations. However, the TextVQA dataset presents a unique case in which the image encoder’s resolution plays a pivotal role in recognizing text within images. Consequently, the ViT-L model, with a resolution of 336x336, achieves a higher score than the ViT-G model, which had a resolution of 224x224. DinoV2 [54], which is trained in a discriminative self-supervised approach, shows worse performance compared to the text-guided vision encoders,
Audio Signals Input
AnyMAL Output Gold Caption: Typing on a keyboard is occurring in a quiet environment Prompt: Can you guess my location? Reply with one or two sentences. Response: Based on the sound of the typing, I would guess that you are located in a quiet, indoor setting, possibly in an office or study. The sound of the typing is clear and crisp, with a slight echo or reverb that suggests a room with a high ceiling or a large, open space.
Gold Caption: Very strong wind is blowing, and waves are rustling Prompt: Based on the weather, what should I wear? Reply with one or two sentences. Response: Based on the sound of the wind and the waves, it seems like a stormy weather is approaching. It’s best to dress in layers and wear something waterproof and warm to stay comfortable and safe.
Gold Caption: A baby cries while electronic speech occurs in the background, an adult female speaks, a sneeze occurs, and an adult female speaks once more. Prompt: Describe the sound in one or two sentences. Response: The sound is a loud, high-pitched scream of a child, with a hint of a whine, as if the child is crying and screaming at the same time. The sound is accompanied by the muffled voices of adults in the background, possibly the child’s parents or caregivers, speaking in a soothing tone.
Table 7: Example outputs from AnyMAL on unseen audio samples. Gold captions are shown as a reference for the audio signals. The text prompt and the audio signals are used as input.
demonstrating the importance of feature space alignment. Among the base LLM models, our 70B model demonstrates the most robust performance, underscoring the influence of substantial reasoning proficiency inherent in larger LLMs on tasks involving visual reasoning. Video QA benchmarks: We evaluate our model on three challenging video question-answering benchmarks in Table 6: How2QA [55], STAR [56] and NextQA [57]. Our model demonstrates competitive results compared to the baselines, and achieves state-of-the-art performance on the STAR benchmark. Note that we compare against approaches that process the full, untrimmed video clip to generate answers. Prior work has shown additional improvements with careful frame-selection strategies [58]. Our approach is compatible with such strategies, however that is beyond the scope of our experiments. We additionally report model variants trained exclusively on videos from HowTo100M [59] coupled with text from ASR transcripts, and using explicit video encoders (Internvideo [46]) as opposed to image encoders. However, these models perform worse due to the weak alignment of ASR to video clips and lower diversity in content (i.e., instructional videos). Ablations on Hyperparameters: Figure 4 shows the training losses on the variants of AnyMAL during image-text pre-training. Due to the high computational cost of training 70B models, we conduct ablations only on the 13B models. We then use the optimal set of hyperparameters from these experiments as the default configuration for
Figure 4: Training losses on variants (13B and 70B) of AnyMAL during image-text pre-training. Default parameters: batch size: 2,048, resampler layers: 2, the number of modality tokens: 64. Each line features an ablation with one differing parameter from the default setting, i.e. bsz16K: 16,384 batch size, emb256: 256 visual tokens, depth6: 6 resampler layers.
IMU Signals Input
Reference Video
AnyMAL Output Prompt: Describe the motion. Response: [Camera wearer] looks around.
Prompt: Describe the motion. Response: [Camera wearer] walks in the house.
Prompt: What’s the safest way to stop? Response: To stop safely on a bike, use the brakes gradually and look where you want to go. Lean the bike slightly to one side and keep your body upright.
Table 8: Example outputs from AnyMAL, on IMU motion sensor samples from Ego4D. The text prompt and the IMU signals are used as input. We display a few frames from the synchronized video as a reference for readers. The reference video is not used as input to the model.
training the 70B models. Since the trainable parameters remain consistent for both the 13B and 70B models, we anticipate that the behavior during model training will remain unchanged. The following key hyperparameters are considered: the layers of the Resampler (2 vs. 6), the number of multimodal token embeddings to represent visual input (64 vs. 256 tokens), and batch size of pre-training (2,048 vs. 16,384). Overall, we observed that increasing the batch size and the number of visual tokens yields minimal improvement. In contrast, increasing the number of resampling layers significantly reduces the loss without substantially increasing the training budget. Effects of Scaling LLM Parameter Size (70B vs. 13B): The 70B model demonstrates a reduced training loss overall when compared with the 13B versions. This loss performance also aligns with the downstream task results in Tables 2 and 4. We attribute this result to the inherent reasoning abilities and the knowledge assimilated within the 70B models, which expedites the visual concept acquisition and alignment process. Overall, the comparison demonstrates the importance of scaling LLM parameters in vision-language pre-training as well, which is an aspect that has seldom been addressed in existing literature. Audio Caption Generation: Table 5 shows the audio captioning results on the AudioCaps [36] benchmark dataset. AnyMAL significantly outperforms other state-of-the-art audio captioning models in the literature (e.g. +10.9pp in CIDEr, +5.8pp in SPICE), showing the versatility of the proposed approach on various modalities beyond just vision. We note that our 70B model displays notably strong performance compared to the 7B and the 13B variants – showing the importance of the reasoning module for the task. IMU Motion Description Generation: We use the Ego4D [38] dataset to train an IMU-aligned AnyMAL-7B model, leveraging the synchronized IMU sensor data and textual narrations provided in the dataset. Given that the task of generating textual descriptions from motion signals has not been previously achievable or reported, we solely present the performance achieved by our own model. On the held-out test set, we achieve 52.5 CIDEr and 23.2 ROUGE-L against the ground-truth captions, showing the feasibility of the newly proposed task. Combining this captioning ability with the reasoning capabilities of LLMs, in Table 8 we show examples of novel applications that AnyMAL might allow, e.g. inferring user motion states and incorporating these as part of its response (e.g. “What’s the safest way to stop?”→“To stop safely on a bike, …” without any textual or visual cues that the user is biking).
Comparison with other Vision-Language Models: Table 9 and 10 show outputs from various vision-language models [17, 19, 20, 21] on diverse example image and prompt pairs, compared with AnyMAL (LLaVA-70B). It can be seen that AnyMAL exhibits strong visual understanding capabilities (such as identification of objects and their states), as well as language generation capabilities. While MiniGPT4 [21] and LLaVA [20] do present reasonable and fluent responses, their accuracy is not consistently ensured. These examples effectively highlight the benefits of the proposed approach which allows for large-scale pre-training covering diverse visual concepts, while inheriting strong reasoning capabilities derived from instruction-tuned LLMs. We note that we use the latest checkpoints made available for each baseline to generate responses. Interleaved Modalities: The flexible model architecture of AnyMAL allows for combinatory modalities as conditioning context (e.g. image + IMU motion sensor signals), which allows for more comprehensive multimodal reasoning. We demonstrate the model’s zeroshot capabilities of handling such interleaved modalities in Table 11 (e.g. composing a message with a given image (Golden Gate Bridge), with the user’s prevalent motion (biking) as part of the context). This result illustrates the new and natural way of interaction with an AI model made possible by AnyMAL, wherein a user can presume a shared understanding of combined sensory perceptions (e.g. visual, auditory, and motion cues) when composing queries – avoiding the need to specify multimodal contexts.
Table 9: Examples comparing AnyMAL with various baselines: BLIP-2 (FlanT5_XXL) [3], Instruct-BLIP (Vicuna-13B) [19], OpenFlamingo-v2-9B [17], Mini-GPT4 [21], and LLaVA [20]. AnyMAL demonstrates competitive performance in object recognition (e.g. “nutcracker”), response relevancy (e.g. writing a joke, explaining how-to), and secondary reasoning.
The wine bottles in the image are two different types of wine, one is a red wine and the other is a white wine. Both of these wines would pair well with steak. The red wine would be a good choice for a rich and flavorful steak, while the white wine would be a good choice for a lighter steak or seafood dish.
For a steak, the red wine is likely a more appropriate pairing compared to the white wine. Whether it is Chianti or Chianti Classico, both options are Italian red wines made from Sangiovese grapes, and their full-bodied nature provides a rich flavor profile, making them good companions to steak dishes. The white wine, Soave Classico DOC, is a white grape varietal, which is less traditional for pairing (…)
Table 10: Examples comparing AnyMAL with various baselines. AnyMAL demonstrates competitive performance in both object recognition (e.g. “Barolo”, “Arno River”) and secondary reasoning.
Table 11: Example outputs from AnyMAL, with multiple interleaved modalities as input. The text prompt and two other modalities (e.g. image & IMU motion sensor signals) are used as input. The underlined text in the response demonstrates the evidence that the output is grounded on multiple modalities.
Inference Time Integrity. To ensure the safety and integrity of the AnyMAL model, several measures are made on the following categories of potential integrity violations: (1) input images, (2) input text prompts, (3) text outputs, and (4) multimodal combination of input images and text outputs.
Training Time Safety. The datasets used for pre-training (e.g. [34, 62]) have gone through a filtration process to remove harmful language or images that compromise integrity, thereby reducing the potential for the model to generate content that violates integrity standards. LLM Safety. Since our AnyMAL pre-training does not alter the parameters of the base LLM, we carry over the same safety precautions implemented for its language generation. For instance, LLaMA-2 (the version we report most of our results on) places safeguards such as negative example fine-tuning, reinforcement learning with human feedback (RLHF) [63, 64, 65].
Our proposed AnyMAL showcases a novel and natural way of interacting with an AI model, e.g. asking questions that presume a shared understanding of the world between the user and the agent, through the same lens and combinatory perceptions (e.g. visual, auditory, and motion cues). The proposed scalable way of training AnyMAL makes it possible to leverage the powerful reasoning capabilities of the LLaMA-2 language model within the multimodal settings. Our contributions are as follows: (1) We present a large-scale Multimodal LLM (AnyMAL), trained using open-sourced resources and scalable solutions for multiple modalities. (2) We introduce the Multimodal Instruction Tuning dataset (MM-IT), a first-of-its-kind collection of high-quality manual annotations of multimodal instruction data. (3) Our comprehensive empirical analysis shows insights to the efficient and scalable recipe for building a multimodal reasoning model, given various LLMs and modeling choices.
We discuss the current limitations of our work as follows. First, the proposed causal multimodal language modeling approach still encounters challenges in establishing a robust grounding with the input modality. Specifically, we observe that during the generation, the model occasionally prioritizes focusing more on the generated text rather than the input image. This leads to the generation of output that incorporates biases acquired from the underlying language model (LLM), which can incur inaccuracies when compared against the image context. We expect that additional architectural adjustments or unfreezing LLM parameters are necessary to address this limitation effectively (albeit the much higher computational costs it might entail). Second, while we greatly increase the size of the pretraining dataset, the understanding of visual concepts and entities remains constrained by the quantity of paired image-text data included in the training process. In the domain of text-only language models, it is commonly observed that approaches incorporating external knowledge retrieval significantly enhance the model’s ability to overcome its knowledge limitations. These approaches offer a potential means to alleviate the limitations mentioned earlier. Lastly, in the scope of our work, the multimodal adaptation of an LLM is bounded by four modalities: image, video, audio, and IMU signals. While we believe that the proposed approach has the potential to encompass any other modality, provided there exists a paired dataset, its effectiveness for such modalities still needs to be substantiated.