00:00:00

Share Your Feedback 🏝️

Model | Apple - Ferret

Model | Apple - Ferret

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | AnyMAL Next: Model | LLaVA

Model | Apple - Ferret

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-23

Ferret: Refer and Ground Anything Anywhere at Any Granularity

  • url: https://arxiv.org/abs/2310.07704
  • pdf: https://arxiv.org/pdf/2310.07704
  • abstract: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at this https URL

Contents

TL;DR


  • Ferret은 통합된 참조 및 그라운딩(Grounding) 작업을 위한 다중모달 대규모 언어모델(MLLM)
  • 이 모델은 복잡한 형태의 이미지 영역을 효과적으로 처리하고, 오픈어휘 참조 및 그라운딩 작업을 수행
  • GRIT 데이터셋을 통해 교육되며, Ferret-Bench를 통해 평가

1. 서론

본 논문에서는 시각-언어 학습 분야에서의 공간 이해 능력 향상이 중요한 연구 과제로 다루어집니다. 특히, 참조(referring)와 그라운딩(grounding)는 이런 공간 정보와 의미 정보의 정렬에 의존하는 두 가지 핵심 기능입니다. 참조는 모델이 특정 영역의 의미를 정확히 이해하는 것을 요구하며, 그라운딩는 주어진 의미 설명에 기반하여 모델이 영역을 지역화하는 과정을 포함합니다. 그러나 기존의 연구들은 이 두 작업을 별도로 처리하는 경향이 있습니다. 이와 대조적으로, 본 논문에서 제안하는 모델 Ferret은 참조와 그라운딩를 하나의 통합된 프레임워크로 결합하여 서로 상호 이익을 가져올 수 있음을 제시합니다.

문제 정의

기존 모델들은 참조와 그라운딩를 별도의 작업으로 취급하여, 각각에 대해 최적화된 특화된 방법을 사용합니다. 이런 접근 방식은 종종 작업 간의 지식 전이를 제한하고, 모델이 하나의 작업에서 배운 지식을 다른 작업에 효과적으로 적용하지 못하게 합니다.

해결 방법

Ferret은 다양한 형태의 영역(점, 상자, 낙서, 자유형 모양)을 자연어 수치 형태로 표현하여 효율적으로 처리할 수 있는 새로운 다중모달 대규모 언어모델(MLLM)을 도입합니다. 이 모델은 공간 인식 시각 샘플러를 통해 어떤 형태의 영역이든 시각적 특성을 획득하고, 이를 연속적인 시각적 특성과 결합하여 하이브리드 영역 표현을 생성합니다.

선행 연구와의 차별점

이전 모델들과 비교하여, Ferret은 참조와 그라운딩를 통합하는 첫 번째 시도로서, 텍스트와 시각적 정보를 통합적으로 처리할 수 있는 능력을 보여줍니다. 이는 모델이 더 넓은 범위의 실제 응용 프로그램에서 유연하게 작동할 수 있도록 지원합니다.


2. 방법

하이브리드 영역 표현

Ferret은 점, 상자 및 자유형 모양과 같은 다양한 영역 형식을 효과적으로 처리하기 위해 하이브리드 영역 표현을 사용합니다. 이는 이산 좌표와 연속 시각적 특성을 결합하여 특정 영역을 참조합니다. 이런 접근 방식은 모델이 모든 형식을 효과적으로 처리할 수 있도록 합니다.

\[\text{Hybrid representation} = f(\text{discrete coordinates}, \text{continuous visual features})\]

$f$는 특정 영역을 참조하기 위해 이산 좌표와 연속적 시각적 특성을 통합하는 함수입니다.

모델 아키텍처

Ferret의 아키텍처는 이미지 인코더, 공간 인식 시각 샘플러 및 LLM을 포함하여 이미지, 텍스트 및 영역 특성을 공동으로 모델링합니다. 이 구조는 입력 이미지, 텍스트 및 참조된 영역을 처리하여 그라운딩된 텍스트 응답을 생성합니다.

공간 인식 시각 샘플러

이 샘플러는 불규칙한 형태의 영역을 처리하도록 설계되었으며, 지역 연속 특성을 추출합니다. 이 과정은 샘플링, 수집 및 풀링 단계를 포함하며, 모델이 다양한 영역 형태를 작업할 수 있게 합니다.

출력

Ferret의 출력에서, 그라운딩를 달성하기 위해, 상자 좌표는 해당 영역/명사 뒤에 텍스트 응답에서 생성됩니다. 이는 모델이 이미지에서 무엇이 그라운딩 가능한지 및 그 객체들이 어디에 위치하는지를 이해하는 데 도움을 줍니다.


3. GRIT: 지시-조정 데이터셋

GRIT 데이터셋은 모델 교육을 위해 약 110만 개의 멀티모달 대화를 포함합니다. 이 데이터셋은 지시-따르기 형식으로 변환된 공개 데이터셋, ChatGPT 및 GPT-4를 통해 생성된 지시-조정 데이터, 그리고 모델의 견고성을 강화하기 위한 공간 부정적 데이터 마이닝을 포함합니다.

데이터 계층 구조

공간 이해는 개별 객체, 객체 간 관계, 특정 지역의 설명, 지역 기반 복잡한 인퍼런스 등 다양한 세부 수준에서 특성화됩니다. 데이터셋 생성 중 이런 차원을 고려하여, 다양한 작업 형식에 맞게 데이터를 구성합니다.

실험

Ferret의 교육 세부사항은 CLIP-ViT-L/14@336p로 초기화된 이미지 인코더, Vicuna로 초기화된 LLM 등을 포함합니다. Ferret은 GRIT 데이터를 사용하여 3epoch 동안 훈련됩니다.


[참고자료 1] Grounding in NLP

자연어 처리(NLP)에서의 ‘그라운딩(Grounding)’이란 텍스트를 데이터나 비텍스트적 모달리티와 연결함으로써 언어 기술과 실제 세계 사이의 상호작용을 촉진하는 것을 의미합니다.

NLP 커뮤니티에서는 이 용어를 광범위하게 사용하여 텍스트와 데이터를 연결하는 모든 경우를 지칭하지만 인지과학에서는 ‘그라운딩’을 두 상호작용자 간의 성공적인 의사소통을 위해 필요한 상호 정보를 확립하는 과정으로 보다 형식적으로 정의합니다. 이 정의는 NLP에서의 사용과는 의도와 범위에서 차이가 있지만, 내포적으로 NLP의 사용을 포착할 수 있습니다.

그라운딩의 핵심 아이디어는 언어적 데이터가 실제 세계와 어떻게 연결되는지를 이해하는 데 있으며, 이 연결을 통해 컴퓨터나 로봇 등의 기술이 사람처럼 정보를 해석하고, 상황에 맞는 반응을 할 수 있도록 돕습니다.

그라운딩에서의 수리적 접근은 주로 모델이 텍스트 데이터와 비텍스트 데이터를 어떻게 통합적으로 처리할 수 있는지에 대한 알고리즘 개발에 초점을 맞춥니다. 예를 들어, 텍스트에서 언급된 객체나 위치를 식별하고, 이를 실제 세계의 지리 정보 시스템(GIS) 데이터와 매핑하는 과정에서 특정 알고리즘이 사용될 수 있습니다.

예시: 챗봇과의 대화

사용자가 챗봇에게 “가장 가까운 카페는 어디야?”라고 묻는 경우를 가정하면, 그라운딩은 챗봇이 사용자의 위치 데이터를 기반으로 실제 지리적 정보에 접근하여, 가장 가까운 카페의 위치를 알려주는 과정에 적용됩니다. 즉, 단순히 텍스트 정보만을 처리하는 것이 아니라, 그 텍스트가 실제 세계의 구체적인 컨텍스트와 어떻게 연결되는지를 이해하고 그에 기반한 유용한 답변을 반환하는 것을 의미합니다.

연구 동향과 발전 방향

최근 NLP 컨퍼런스에서는 다양한 도메인과 작업에 걸쳐 그라운딩을 적용한 연구들이 소개되고 있습니다. 예를 들어, 의료 분야에서 환자의 자연어 진술을 기반으로 의료 이미지 데이터를 연결하여 진단을 지원하는 시스템이나, 자동차 내비게이션 시스템에서 사용자의 목적지에 관한 텍스트 명령을 실시간 교통 데이터와 결합하여 최적의 경로를 제공하는 시스템 등이 그라운딩에 속할 수 있습니다.

이처럼 그라운딩은 NLP 기술이 실제 세계의 문제를 해결하고, 더 휴먼적인 방식으로 정보를 처리하게 하며 기계와 휴먼 간의 의사소통의 질을 높이고, 보다 정확하고 유용한 정보 제공이 가능해지게할 수 있습니다.

Grounding 참고 자료: https://arxiv.org/abs/2106.02192


1. Introduction

In the field of vision-language learning, enabling spatial understanding in models is a fundamental research challenge. Two key capabilities arising from this problem are referring and grounding. Referring requires a model to accurately understand the semantics of specific regions, while grounding involves the model localizing regions based on given semantic descriptions. Despite these two capabilities relying on the alignment of spatial information and semantics, existing work often treats referring and grounding as separate tasks. In contrast, humans can learn from one task and apply shared knowledge to the other task seamlessly, integrating referring and grounding into daily dialogue and reasoning.

This paper addresses three main questions:

  1. How can referring and grounding be unified into a single framework, and will they mutually benefit each other?
  2. How can versatile types of regions, such as points, boxes, scribbles, and free-form shapes, commonly used by humans for referring, be effectively represented?
  3. How can referring and grounding be made open-vocabulary, instruction-following, and robust for practical applications?

To tackle these questions, this paper introduces Ferret, a novel Multimodal Large Language Model (MLLM) for referring and grounding. Ferret leverages the powerful vision-language global understanding capability of MLLMs to unify referring and grounding. It represents region coordinates in natural language numerical form, ensuring efficiency. However, representing various region shapes like strokes, scribbles, and complex polygons with single point or box coordinates is inefficient. To address this issue, a spatial-aware visual sampler is introduced to acquire visual features for regions in any shape, taking into account the varying sparsity of these shapes. These discrete coordinates and continuous visual features are combined to create a hybrid region representation in Ferret. This enables Ferret to handle inputs that mix referred regions with free-form text and to ground the mentioned objects in its output by generating coordinates for each groundable object along with text. Notably, Ferret is the first model capable of processing free-formed region inputs in MLLMs.

To make Ferret’s refer-and-ground capability open-vocabulary, instruction-following, and robust, the authors introduce the Ground-and-Refer Instruction-Tuning dataset (GRIT), containing 1.1 million samples. GRIT covers multiple levels of spatial knowledge, including objects, relationships, region descriptions, and complex reasoning. It consists of both location-based grounding and text-based referring data, as well as data that combines location and text in both input and output. Much of the dataset is derived from existing vision(-language) tasks but with templates designed to make it instruction-following. Additionally, 34,000 refer-and-ground instruction-tuning conversations are collected with the assistance of ChatGPT/GPT-4 to facilitate training an instruction-following and open-vocabulary refer-and-ground model. The authors also perform spatial-aware negative data mining to enhance model robustness.

Ferret possesses strong open-vocabulary spatial understanding and localization capabilities. When evaluated on traditional referring and grounding tasks, it exhibits superior performance. Beyond these tasks, Ferret is designed to integrate refer-and-ground capabilities into everyday human conversations. The paper introduces Ferret-Bench, covering three new types of tasks: Referring Description, Referring Reasoning, and Grounding in Conversation. In benchmarking against existing MLLMs, Ferret outperforms the best of them by an average of 20.4%. Furthermore, Ferret demonstrates the intriguing ability to reduce object hallucinations.

Note: There is no need for additional vocabulary or position encoders in the Ferret model.

In summary, our contributions are threefold

  1. We propose Ferret, which utilizes a hybrid region representation equipped with a novel spatial-aware visual sampler to enable fine-grained and open-vocabulary referring and grounding in MLLM.
  2. We construct GRIT, a large-scale ground-and-refer instruction-tuning dataset, for model training. It also contains additional spatial negative samples to enhance model robustness.
  3. We introduce Ferret-Bench, which evaluates tasks jointly requiring referring/grounding, semantics, knowledge, and reasoning. Our model exhibits superior performance in a wide range of tasks and reduces object hallucination.

2.1 Multimodal Large Language Models (MLLMs)

Large Language Models (LLMs), including GPTs, PaLM, BLOOM, and LLaMA, have advanced research in NLP and multimodal language models. These models are trained on image-text data and have been used in various multimodal tasks.

2.2 MLLMs for Referring and Grounding

In the existing literature, works such as Kosmos-2, Shikra, GPT4ROI, PVIT, BuboGPT, VisionLLM, and ContextDET enable MLLMs for fine-grained image comprehension and open-world referring and grounding. While these works support bounding boxes, Ferret, with its innovative hybrid region representation, can handle a broader range of free-form shapes for referring, including points, boxes, sketches, scribbles, polygons, and more. It also introduces a comprehensive refer-and-ground instruction-tuning dataset and Ferret-Bench for evaluation.

2.3 Unifying Grounding and VL Understanding

Ferret unifies text and bounding box output for vision-language models, building upon LLMs. It handles bounding box coordinates as regular text tokens, eliminating the need for extra specialized tokens dedicated to representing boxes.

3. Method

3.1 Hybrid Region Representation

To refer to specific regions, three primary formats are generally used: point, box, and free-form shapes. Ferret proposes a hybrid region representation that combines discrete coordinates with continuous visual features to refer to a particular region. This allows it to handle all three formats effectively.

3.2 Model Architecture

Ferret’s model architecture consists of an image encoder, spatial-aware visual sampler, and an LLM to jointly model image, text, and region features. It processes input images, text, and referred regions to produce grounded text responses.

3.3 Spatial-aware Visual Sampler

Ferret’s spatial-aware visual sampler is designed to handle irregularly shaped regions and extracts regional continuous features. It involves sampling, gathering, and pooling steps, allowing the model to work with various region shapes.

In the Ferret output, to achieve grounding, box coordinates are generated right after the corresponding regions/nouns in the text response. This helps the model understand what is groundable in the image and where these objects are located.

Ferret utilizes Vicuna as its language model, a decoder-only LLM, which is instruction-tuned on top of LLaMA. The model processes both image and text data to generate responses.

4. GRIT: Ground and Refer Instruction-Tuning Dataset

In this section, we present GRIT, a Ground-and-Refer Instruction-Tuning dataset containing approximately 1.1M multimodal dialogues for model training. GRIT consists of three types of data:

  1. Public datasets converted into an instruction-following format (Section 4.1).
  2. Instruction-tuning data generated through ChatGPT and GPT-4 (Section 4.2).
  3. Additional data for enhancing model robustness through spatial negative mining (Section 4.3).

4.1 Hierarchy

Spatial understanding can be characterized based on granularity and task formats. During dataset creation, we consider four main granularity categories: individual objects, relationships among objects, descriptions of specific regions, and region-based complex reasoning. In terms of task format, we further divide the data into three types: Region-in Text-out data, Text-in Region-out data, and Text-Region combined data.

We compile extensive public data considering these dimensions and convert them into an instruction-following format using carefully designed templates.

4.2 Individual objects

To achieve visual understanding at the object level, we select object detection datasets like Visual Genome and Object365, as well as visual grounding datasets like RefCOCOs and Flickr30k-Entities. The data formats vary: Visual Genome object data follows a Region-in Text-out format, while visual grounding datasets and Object365 adhere to a Text-in Region-out format. This section contains a total of 678k data.

4.3 Relationships among objects & descriptions of regions

We selected data pertaining to object relationships and region captions from Visual Genome to address these two facets. Both datasets employ a Region-in Text-out format, resulting in 177k data. Similar to Visual Genome object data, we also extract segmentation masks of objects in Visual Genome relationship data via SAM.

4.4 Region-based complex reasoning

Regarding complex reasoning centered on specific regions, we constructed a novel dataset with the help of ChatGPT/GPT-4. It adopts a combined Text-Region format and is detailed in the subsequent section.

5 GPT-assisted Visual Instruction Data Generation

In addition to converting existing datasets, dialogue instruction tuning data is crucial for MLLM to understand human intention and generate responses. Few-shot prompting is used to obtain visual instruction tuning data, with a focus on region-based spatial knowledge. The process involves:

  1. Using symbolic scene descriptions, including object relationships, region captions, and coordinates.
  2. Human-annotated dialogues that emphasize specific regions and coordinates in input or output.
  3. Refinement of generated dialogues using ChatGPT/GPT-4 to ensure they follow patterns and rules.

Additionally, existing instruction-tuning data is processed to localize groundable nouns in the text and append bounding boxes, forming pseudo-grounded LLaVA instruction data for Ferret training.

5.1 Spatial Negative Mining

As highlighted in prior studies (Li et al., 2023e; Liu et al., 2023a), MLLM exhibits a propensity to hallucinate in response to yes/no questions. We observed a similar occurrence when inquiring about detailed regions. To address this, we also conduct negative sample mining by following two ways: (i) Image-conditioned Category Localization, and (ii) Semantics-conditioned Category Localization. They both ask the model to localize specific object categories, thereby enabling the model’s ability to discern and potentially recognize the absence of certain objects. They differ in how to select the negative category. For (i), Object365 data are employed, and we randomly select the object class from the vocabulary that is not shown in the given image. For (ii), Flickr30k data is used, and negative categories are sourced by utilizing ChatGPT/GPT4 to find entities that are most analogous to the original class, attribute, or quantity, e.g., ‘man’ vs. ‘woman’, ‘blue’ vs. ‘yellow’, ‘two’ vs. ‘three’.

We curate the data to maintain an equilibrium between positive and negative samples for each of the two types. 95k data are collected. A more comprehensive elaboration is provided in Appendix A.2.

6. Experiments

First of all, we illustrate the training details of Ferret. Then in evaluation, we start with evaluating Ferret on conventional referring and grounding benchmarks (Sec. 5.1 and 5.2). Then, we demonstrate the power of Ferret in more complex multimodal chatting with refer-and-ground capability in Sec. 5.3. For a detailed visualization of each, kindly check Appendix C. We further ablate key components in Ferret (Sec. 5.4), analyze the object hallucination of Ferret (Sec. 5.5) and discuss Ferret v.s. GPT-4V (Sec. 5.6).

6.1 Training Details

We initialize the image encoder with CLIP-ViT-L/14@336p, the LLM with Vicuna, and the projection layer with LLaVA’s first-stage weights, leaving the visual sampler randomly initialized. After the initialization, Ferret is trained on the aforementioned GRIT data for three epochs, optimized by Loshchilov & Hutter (2017) with a learning rate of 2e − 5 and a batch size of 128. The training takes ∼5/2.5 days on 8 A100 GPU for a Ferret-13B/7B. During training, when input refers to regions, we randomly choose either the center points or the bounding boxes.


Input Refering REFERRING

The model’s capability of understanding referring is reflected in that, given a referred region in the question, how accurately the model can understand the semantics of the referred region. To measure it, we start with the most basic semantics, object, as it is fundamental and clear to define. To be more specific, the task we evaluate on is Referring Object Classification: the question refers to a specific region in the image, and the model needs to classify the object in the region. Since Ferret and MLLMs usually generate free-form text responses, it is inaccurate to match the predicted class with the ground-truth class if directly asking the model to classify without constraints. Alternatively, we make it a binary-choice question in the format of “Is the object ⟨location⟩ a ⟨class A⟩ or a ⟨class B⟩?”. We feed the binary-choice question and image into the MLLMs to obtain the response, and then detect if the response matches the ground-truth (GT) class by some rule.

To prepare the data, we used the validation split of LVIS dataset (Gupta et al., 2019) covering over 1000 object categories, and sampled 2667 objects as the GT objects. Then, we randomly choose a different object category in the same image whose central point is close to the GT object as the negative object, and replace ⟨class A⟩ and ⟨class B⟩ with those two randomly to form 2667 questions. Additionally, to mimic the versatility of referring in human life, we replace the ⟨location⟩ with three different types: point, box, and free-form shape. For point, we randomly sample a point inside the GT object that is also near the GT object’s boundary. For box, we use the GT bounding box provided by LVIS. For the free-form shape, we randomly generate some strokes inside the GT object to simulate that. Results on all three types of referring are summarized in Table 3. Ferret can significantly outperform previous models (Peng et al., 2023; Chen et al., 2023b) and handle all types of referring, a capability notably absent in previous works. We visualize some examples in Figure 5.


Output Grounding

Ferret performs well in referential dialogue, allowing for its integration into various VL tasks, notably those with grounding outputs. To rigorously assess the grounding capability, we first subject Ferret to benchmark visual grounding tasks in a generative paradigm. Then, to measure the alignments between words and regions, we further evaluate Ferret on grounded captioning task.


Visual Grounding

Visual grounding aims to ground language queries into aligned image regions. We experiment on the sub-tasks of referring expression comprehension (REC) with three renowned benchmarks: RefCOCO (Lin et al., 2014), RefCOCO+ (Yu et al., 2016), and RefCOCOg (Mao et al., 2016), and phrase grounding with Flickr30k Entities dataset (Plummer et al., 2015). REC task involves a question or description about a specific area in an image, with the model expected to predict just one bounding box. Phrase grounding, conversely, seeks to associate all the noun phrases in the input sentence with corresponding boxes, requiring the model to predict these boxes and the word-box connections. For both tasks, we utilize uniform prompts, represented as “What are the locations of /?” where denotes the textual referring expression, while stands for a "comma-delimited" aggregation of the given phrases. The model is trained to output in “ [box].” format. The generated bounding box is considered correct if its intersection over union (IoU) with the GT box is greater than 0.5. As shown in Table 5, Ferret achieves outstanding performance on all metrics, and is comparable to specialized fine-tuning approaches (Kamath et al., 2021). Some results are visualized in Figure 5.


Grounded Captioning

The grounded captioning task requires the model to generate a caption and ground all generated noun phrases to image regions. The final predictions generally consist of three parts, i.e., the text caption, visual regions as boxes, and the grounding alignments between words and boxes. Following the established benchmarks on the Flickr30k Entities dataset, we evaluate captioning and grounding separately with the captioning metrics and grounding F1 scores, respectively. F1all evaluates grounding as a multi-label classification problem. We also report F1loc that only computes the grounding score on correctly predicted object words. Results are summarized in Table 4, and Ferret achieves state-of-the-art.

In the original text, there are references to tables and figures (e.g., Table 3 and Figure 5), but the actual content of these tables and figures is not provided in the text. If you need those details as well, please provide the content of the specific tables and figures you’re interested in.

6.2 FERET-Bench: Multimodal Chatting With Refering and Grounding

Multimodal chatting has been an emergent ability of MLLMs. Previous benchmarks (Liu et al., 2023b) mainly evaluate conversation, detailed description, and complex reasoning via GPT-4 as a judge. Yet, a gap exists as no dataset currently evaluates multimodal chatting that necessitates referring or grounding actions, e.g., instances where individuals reference an unfamiliar object and inquire about its purpose. To benchmark this intriguing and practical capability, we introduce Ferret-Bench that covers three kinds of region-based questions evaluating referring and grounding capa-

7. Ablation

In the ablation studies below, by default, we ablate Ferret-7B and mainly evaluate referring object classification and grounding tasks on the Flickr30k Entities validation set.

Mutual benefits of grounding and referring. As shown in Table 8, grounding and referring, as two main capabilities emphasized in this paper, can actually benefit each other. Particularly, when adding grounding data into training, the referring performance gets improved, and vice versa.

Spatial-aware Visual Sampler. We ablate the effectiveness of the spatial-aware visual sampler by replacing it with the visual sampler in SEEM (Zou et al., 2023), where they average the features of all the sampled points as the region feature. As we can see in Table 9, ours can outperform the previous visual sampler in all three referring tasks.

LLM model size. We study how much LLM model size influences the performance of referring and grounding. As seen in Table 3-7, having a larger LM backbone can generally help.

7.1 Object Hallucination

Attribute to the incorporation of fine-grained spatial knowledge and negative mining, Ferret also exhibits strong power against the hallucination problem. We evaluate object hallucinations on the POPE benchmark (Li et al., 2023e). Results are summarized in Table 10. Ferret has exhibited performance comparable to Shikra (Chen et al., 2023b), and far surpasses recent popular MLLMs.

7.2 FERRET v.s. GPT-4 VIsion: A Quick Glance At Refering & Grounding

Recently, GPT-4 released its multimodal version to the public, which is named GPT-4V. In a follow-up technical report (Yang et al., 2023), GPT-4V’s grounding ability is briefly touched. In this section, we use some examples to probe GPT-4V’s referring and grounding capabilities, and compare with Ferret. For referring, GPT-4V is prompted with the following two ways: (i) referred regions are marked by red circle/outline in the image and the question asks about the region in red circle/outline. (ii) image is still but instead, we provide the image size and coordinates in question to refer to

The result on LLaVA-Bench is obtained by evaluating LLaVA released checkpoint. The slight discrepancy might be due to evolving GPT4 APIs. For Ferret-Bench, we employ the same conversation template as Ferret, providing LLaVA with a predefined input size, resizing all coordinates accordingly, and generating a response.

Unlike other methods, Ferret refrains from relying on VQA. This decision stems from our observation that VQA answers tend to be concise, and this brevity can restrict the conversational capabilities of LLMs.

As we observed, GPT-4V is able to understand the referring to a certain extent via either colored region in the image or coordinates in text. However, compared with Ferret, GPT-4V falls short in precise understanding when referring to small regions, e.g., the ‘shock absorber‘ in the motorcycle (see the upper example in Figure 6). On the other hand, GPT-4V is more knowledgeable in common-sense, e.g., it can further highlight that the exhaust pipe can reduce the noise, a nuance potentially attributable to GPT-4’s enhanced linguistic capabilities. In regard to grounding, we tested GPT-4V with CAPTCHAs, a task which is also mentioned in Yang et al. (2023). In the traffic light example, Ferret excels at accurately identifying most traffic lights even in cluttered scenes, as demonstrated in the bottom example of Figure 6.

8. CONCLUSION

We present Ferret, a new multimodal large language model adept at referring and grounding. Ferret can refer image regions in any free-form shape, and automatically establish grounding for text deemed groundable by the model. We have curated the GRIT dataset for model training, and the Ferret-Bench dataset for evaluation. Ferret, like most MLLMs, may produce harmful and counter-factual responses. For future work, inspired by LISA (Lai et al., 2023), we plan to enhance Ferret to be able to output segmentation masks in addition to bounding boxes.

Previous: Model | AnyMAL Next: Model | LLaVA

post contain ""

    No matching posts found containing ""