Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B! The opensource is integrated to Hugging Face Transformers, vLLM, and other third-party frameworks. Hope you enjoy!
Performance
We evaluate our model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.
At the 7B scale, we’ve managed to retain support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective model size. Specifically, our model excels in document understanding tasks such as DocVQA and in multilingual text understanding from images, as assessed by MTVQA, establishing state-of-the-art performance.
Additionally, we’re excited to introduce a smaller 2B model, optimized for potential mobile deployment. Despite its compact size, this model boasts strong performance in image, video, and multilingual comprehension. It particularly shines in video-related tasks, document understanding, and general scenario question-answering when compared to other models of similar scale.
Model Capabilities
Example: Multi-object Recognition (1/6) Next User Output the color and number of each box.image Qwen2-VL The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom:
Top row (single block): Blue with the number 9 Second row (two blocks): Light blue with the number 7, and green with the number 8 Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6 Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3
Example: Code (1/7) Next User Solve the algorithmic problem in the web page.image Qwen2-VL To solve the “Snake in Matrix” problem, we need to implement a function that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. Here’s a step-by-step approach to solve this problem:
Initialize Variables:
Initialize the current position of the snake as 0. Initialize the direction of the snake as “RIGHT” (initially, the snake moves to the right). Define Movement Functions:
Create a function to move the snake based on the current direction and the size of the matrix n. Update the position of the snake accordingly. Process Commands:
Iterate through each command in the list of commands. For each command, update the direction and position of the snake. Return Final Position:
After processing all commands, return the final position of the snake. Here’s the Python code to implement the solution:
class Solution: def finalPositionOfSnake(self, n: int, commands: List[str]) -> int: # Initialize variables position = 0 direction = “RIGHT”
# Define movement functions
def move_right():
nonlocal position
position += 1
def move_down():
nonlocal position
position += n
def move_left():
nonlocal position
position -= 1
def move_up():
nonlocal position
position -= n
# Process commands
for command in commands:
if command == "RIGHT":
move_right()
elif command == "DOWN":
move_down()
elif command == "LEFT":
move_left()
elif command == "UP":
move_up()
# Return final position
return position This code defines a class Solution with a method finalPositionOfSnake that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. The movement functions (move_right, move_down, move_left, move_up) are used to update the position of the snake based on the current direction.
Example: Video Understanding (1/2) Next User Please describe the video in detail. Qwen2-VL The video begins with a man speaking to the camera, followed by a group of people sitting in a control room. The camera then cuts to two men floating inside a space station, where they are seen speaking to the camera. The men appear to be astronauts, and they are wearing space suits. The space station is filled with various equipment and machinery, and the camera pans around to show the different areas of the station. The men continue to speak to the camera, and they appear to be discussing their mission and the various tasks they are performing. Overall, the video provides a fascinating glimpse into the world of space exploration and the daily lives of astronauts. User What is the color of the clothes the two astronauts wearing? Qwen2-VL The two astronauts are wearing blue and black clothes. Example: Live chat (1/2) Next Live chat
The model facilitates Function Calling, enabling it to harness external tools for real-time data retrieval – be it flight statuses, weather forecasts, or package tracking – by deciphering visual cues. This integration of visual interpretation with functional execution elevates its utility, making it a powerful tool for information management and decision-making. Example: Function Calling (1/4) Next Ask about the weather Visual Interactions represent a significant stride towards mimicking human perception. By allowing the model to engage with visual stimuli akin to human senses, we’re pushing the boundaries of AI’s ability to perceive and respond to its environment. This capability paves the way for more intuitive and immersive interactions, where Qwen2-VL acts not just as an observer, but an active participant in our visual experiences. Example: UI Interactions (1/4) Next Operate a Mobile Phone Certainly, the model is not perfect and has some limitations that I hope you can understand. For example, the model is unable to extract audio from videos, and its knowledge is only up to date as of June 2023. Additionally, the model cannot guarantee complete accuracy when processing complex instructions or scenarios, and it is relatively weak in tasks involving counting, character recognition, and 3D spatial awareness.
Model Architecture Overall, we’ve continued with the Qwen-VL architecture, which leverages a Vision Transformer (ViT) model and Qwen2 language models. For all these variants, we utilized a ViT with approximately 600M parameters, designed to handle both image and video inputs seamlessly. To further enhance the model’s ability to effectively perceive and comprehend visual information in videos, we introduced several key upgrades:
A key architectural improvement in Qwen2-VL is the implementation of Naive Dynamic Resolution support. Unlike its predecessor, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, thereby ensuring consistency between the model input and the inherent information in images. This approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.
Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding (M-ROPE). By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information,M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information.
Developing with Qwen2-VL To use the largest Qwen2-VL model, Qwen2-VL-72B, you can access it through our official API (sign up the account and obtain the API key through DashScope) temporarily as demonstrated below:
from openai import OpenAI import os import base64
def encode_image(image_path): with open(image_path, “rb”) as image_file: return base64.b64encode(image_file.read()).decode(“utf-8”)
image_path = “dog_and_girl.jpeg”
base64_image = encode_image(image_path)
def get_response(): client = OpenAI( api_key=os.getenv(“DASHSCOPE_API_KEY”), base_url=”https://dashscope.aliyuncs.com/compatible-mode/v1”, ) completion = client.chat.completions.create( model=”qwen-vl-max-0809”, messages=[ { “role”: “user”, “content”: [ {“type”: “text”, “text”: “What is this?”}, { “type”: “image_url”, “image_url”: { “url”: “https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg” }, }, { “type”: “image_url”, “image_url”: {“url”: f”data:image/jpeg;base64,{base64_image}”}, }, ], } ], top_p=0.8, stream=True, stream_options={“include_usage”: True}, ) for chunk in completion: print(chunk.model_dump_json())
if name == “main”: get_response() The 2B and 7B models of the Qwen2-VL series are open-sourced and accessible on Hugging Face and ModelScope. You can explore the model cards for detailed usage instructions, features, and performance metrics. Below we provide an example of the simplest usage with HF Transformers.
Make sure you install transformers from source by pip install git+https://github.com/huggingface/transformers as codes for Qwen2-VL were just merged into the main branch. If you didn’t install it from source, you may encounter the following error:
KeyError: ‘qwen2_vl’ We offer a toolkit to help you handle various types of visual input more conveniently. It supports inputs including base64, URLs, and interleaved images and videos. You can install it using the following command:
pip install qwen-vl-utils Here is a code snippet for demonstration. Specifically, we recommend using flash attention 2 if possible for the sake of acceleration and memory saving.
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, device_map=”auto” )
processor = AutoProcessor.from_pretrained(“Qwen/Qwen2-VL-7B-Instruct”)
messages = [ { “role”: “user”, “content”: [ { “type”: “image”, “image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”, }, {“type”: “text”, “text”: “Describe this image.”}, ], } ]
text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors=”pt”, )
generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) To facilitate seamless integration and use of our latest models, we support a range of tools and frameworks in the open-source ecosystem, including quantization (AutoGPTQ, AutoAWQ), deployment (vLLM), finetuning (Llama-Factory), etc.