Not published yet.
Contents
성능 표로 대체
The datasets for evaluation include:
Qwen Models
Datasets | Qwen1.5-0.5B-Chat | Qwen2-0.5B-Instruct | Qwen1.5-1.8B-Chat | Qwen2-1.5B-Instruct | Qwen1.5-7B-Chat | Qwen2-7B-Instruct | Llama-3-70B-Instruct | Qwen1.5-72B-Chat | Qwen2-72B-Instruct | Qwen2-57B-A14B-Instruct | Mixtral-8x7B-Instruct-v0.1 | Yi-1.5-34B-Chat | Qwen1.5-32B-Chat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
# Non-Emb Params | 0.5B | 0.5B | 1.3B | 1.3B | 7.7B | 7.6B | - | 72B | 72B | 14B | 12B | 34B | 32B |
English | |||||||||||||
MMLU | 35.0 | 37.9 | 43.7 | 52.4 | 59.5 | 70.5 | 82.0 | 75.6 | 82.3 | 75.4 | 71.4 | 76.8 | 74.8 |
(70.0) | (75.8) | (33.6) | (40.3) | (7.7) | (9.3) | (1.05) | (1.14) | (5.39) | (5.95) | (2.26) | (2.34) | ||
MMLU-Pro | - | - | - | - | 29.1 | 44.1 | 56.2 | 51.7 | 64.4 | 52.8 | 43.3 | 52.3 | 46.4 |
(3.78) | (5.80) | (0.72) | (0.89) | (3.77) | (3.61) | (1.54) | (1.45) | ||||||
GPQA | - | - | - | - | 27.8 | 25.3 | 41.9 | 39.4 | 42.4 | 34.3 | - | - | 30.8 |
(3.61) | (3.33) | (0.55) | (0.59) | (2.45) | - | - | (0.96) | ||||||
Theorem QA | - | - | - | - | 14.1 | 25.3 | 42.5 | 28.8 | 44.4 | 33.1 | - | - | 30.9 |
(1.83) | (3.33) | (0.40) | (0.62) | (2.37) | - | - | (0.96) | ||||||
MT-Bench | - | - | - | - | 7.60 | 8.41 | 8.95 | 8.61 | 9.12 | 8.55 | 8.30 | 8.50 | 8.30 |
(0.99) | (1.11) | (0.12) | (0.13) | (0.61) | (0.69) | (0.25) | (0.26) | ||||||
Arena-Hard | - | - | - | - | - | - | - | 36.1 | 48.1 | - | - | - | - |
- | - | (0.50) | (0.67) | - | - | - | - | ||||||
Coding | |||||||||||||
HumanEval | 9.1 | 17.1 | 25.0 | 37.8 | 46.3 | 79.9 | 81.7 | 71.3 | 86.0 | 79.9 | 45.1 | 75.2 | 68.3 |
(18.2) | (34.2) | (19.2) | (29.1) | (6.01) | (10.5) | (0.99) | (1.19) | (5.71) | (3.76) | (2.21) | (2.13) | ||
MBPP | - | - | - | - | 48.9 | 67.2 | 82.3 | 71.9 | 80.2 | 70.9 | 59.5 | 74.6 | 67.9 |
(6.35) | (8.84) | (1.00) | (1.11) | (5.07) | (4.96) | (2.19) | (2.12) | ||||||
MultiPL-E | - | - | - | 27.2 | 59.1 | 63.4 | 48.1 | 69.2 | 66.4 | - | - | 50.7 | |
(3.53) | (7.78) | (0.67) | (0.96) | (4.74) | - | - | (1.58) | ||||||
Evalplus | - | - | - | - | 44.8 | 70.3 | 75.2 | 66.9 | 79.0 | 71.6 | 48.5 | - | 63.6 |
(5.82) | (9.24) | (0.93) | (1.10) | (5.11) | (4.04) | - | (1.99) | ||||||
LiveCodeBench | - | - | - | - | 6.0 | 26.6 | 29.3 | 17.9 | 35.7 | 25.5 | 12.3 | - | 15.2 |
(0.78) | (3.50) | (0.25) | (0.50) | (1.82) | (1.03) | - | (0.47) | ||||||
Mathematics | |||||||||||||
GSM8K | 11.3 | 40.1 | 35.3 | 61.6 | 60.3 | 82.3 | 93.0 | 82.7 | 91.1 | 79.6 | 65.7 | 90.2 | 83.6 |
(22.6) | (80.2) | (27.2) | (47.4) | (7.8) | (10.8) | (1.15) | (1.27) | (5.69) | (5.47) | (2.65) | (2.61) | ||
MATH | - | - | - | - | 23.2 | 49.6 | 50.4 | 42.5 | 59.7 | 49.1 | 30.7 | 50.1 | 42.4 |
(3.01) | (6.53) | (0.59) | (0.83) | (3.51) | (2.56) | (1.47) | (1.33) | ||||||
Chinese | |||||||||||||
C-Eval | 37.2 | 45.2 | 55.3 | 63.8 | 67.3 | 77.2 | 61.6 | 76.1 | 83.8 | 80.5 | - | - | 76.7 |
(74.4) | (90.4) | (42.5) | (49.1) | (8.7) | (10.2) | (1.05) | (1.16) | (5.75) | - | - | (2.40) | ||
AlignBench | - | - | - | - | 6.20 | 7.21 | 7.42 | 7.28 | 8.27 | 7.36 | 5.70 | 7.20 | 7.19 |
(0.80) | (0.95) | (0.10) | (0.11) | (0.53) | (0.47) | (0.21) | (0.22) | ||||||
Multilingual | |||||||||||||
IFEval (Prompt Strict-Acc) | 14.6 | 20.0 | 16.8 | 29.0 | - | - | 77.3 | 55.8 | 77.6 | - | - | - | - |
(29.2) | (40.0) | (12.9) | (22.3) | - | - | (0.78) | (1.08) | - | - | - | - |
(본 블로그 각주) The numbers in orange parentheses represent the benchmark score divided by the parameter size of each model.
Qwen-7B-Instruction
Datasets | Llama-3-8B-Instruct | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen1.5-7B-Chat | Qwen2-7B-Instruct |
---|---|---|---|---|---|
English | |||||
MMLU | 68.4 | 69.5 | 72.4 | 59.5 | 70.5 |
MMLU-Pro | 41.0 | - | - | 29.1 | 44.1 |
GPQA | 34.2 | - | - | 27.8 | 25.3 |
TheroemQA | 23.0 | - | - | 14.1 | 25.3 |
MT-Bench | 8.05 | 8.20 | 8.35 | 7.60 | 8.41 |
Coding | |||||
HumanEval | 62.2 | 66.5 | 71.8 | 46.3 | 79.9 |
MBPP | 67.9 | - | - | 48.9 | 67.2 |
MultiPL-E | 48.5 | - | - | 27.2 | 59.1 |
Evalplus | 60.9 | - | - | 44.8 | 70.3 |
LiveCodeBench | 17.3 | - | - | 6.0 | 26.6 |
Mathematics | |||||
GSM8K | 79.6 | 84.8 | 79.6 | 60.3 | 82.3 |
MATH | 30.0 | 47.7 | 50.6 | 23.2 | 49.6 |
Chinese | |||||
C-Eval | 45.9 | - | 75.6 | 67.3 | 77.2 |
AlignBench | 6.20 | 6.90 | 7.01 | 6.20 | 7.21 |
Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct | |||||
Datasets | Qwen1.5-0.5B-Chat | Qwen2-0.5B-Instruct | Qwen1.5-1.8B-Chat | Qwen2-1.5B-Instruct | |
MMLU | 35.0 | 37.9 | 43.7 | 52.4 | |
HumanEval | 9.1 | 17.1 | 25.0 | 37.8 | |
GSM8K | 11.3 | 40.1 | 35.3 | 61.6 | |
C-Eval | 37.2 | 45.2 | 55.3 | 63.8 | |
IFEval (Prompt Strict-Acc.) | 14.6 | 20.0 | 16.8 | 29.0 |
Qwen2-7B
Datasets | Mistral-7B | Gemma-7B | Llama-3-8B | Qwen1.5-7B | Qwen2-7B |
---|---|---|---|---|---|
# Params | 7.2B | 8.5B | 8.0B | 7.7B | 7.6B |
# Non-emb Params | 7.0B | 7.8B | 7.0B | 6.5B | 6.5B |
English | |||||
MMLU | 64.2 | 64.6 | 66.6 | 61.0 | 70.3 |
MMLU-Pro | 30.9 | 33.7 | 35.4 | 29.9 | 40.0 |
GPQA | 24.7 | 25.7 | 25.8 | 26.7 | 31.8 |
Theorem QA | 19.2 | 21.5 | 22.1 | 14.2 | 31.1 |
BBH | 56.1 | 55.1 | 57.7 | 40.2 | 62.6 |
HellaSwag | 83.2 | 82.2 | 82.1 | 78.5 | 80.7 |
Winogrande | 78.4 | 79.0 | 77.4 | 71.3 | 77.0 |
ARC-C | 60.0 | 61.1 | 59.3 | 54.2 | 60.6 |
TruthfulQA | 42.2 | 44.8 | 44.0 | 51.1 | 54.2 |
Coding | |||||
HumanEval | 29.3 | 37.2 | 33.5 | 36.0 | 51.2 |
MBPP | 51.1 | 50.6 | 53.9 | 51.6 | 65.9 |
EvalPlus | 36.4 | 39.6 | 40.3 | 40.0 | 54.2 |
MultiPL-E | 29.4 | 29.7 | 22.6 | 28.1 | 46.3 |
Mathematics | |||||
GSM8K | 52.2 | 46.4 | 56.0 | 62.5 | 79.9 |
MATH | 13.1 | 24.3 | 20.5 | 20.3 | 44.2 |
Chinese | |||||
C-Eval | 47.4 | 43.6 | 49.5 | 74.1 | 83.2 |
CMMLU | - | - | 50.8 | 73.1 | 83.9 |
Multilingual | |||||
Multi-Exam | 47.1 | 42.7 | 52.3 | 47.7 | 59.2 |
Multi-Understanding | 63.3 | 58.3 | 68.6 | 67.6 | 72.0 |
Multi-Mathematics | 26.3 | 39.1 | 36.3 | 37.3 | 57.5 |
Multi-Translation | 23.3 | 31.2 | 31.9 | 28.4 | 31.5 |
Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct
Datasets | Phi-2 | Gemma-2B | MiniCPM | Qwen1.5-1.8B | Qwen2-0.5B | Qwen2-1.5B |
---|---|---|---|---|---|---|
# Non-Emb Params | 2.5B | 2.0B | 2.4B | 1.3B | 0.35B | 1.3B |
MMLU | 52.7 | 42.3 | 53.5 | 46.8 | 45.4 | 56.5 |
MMLU-Pro | - | 15.9 | - | - | 14.7 | 21.8 |
Theorem QA | - | - | - | - | 8.9 | 15.0 |
HumanEval | 47.6 | 22.0 | 50.0 | 20.1 | 22.0 | 31.1 |
MBPP | 55.0 | 29.2 | 47.3 | 18.0 | 22.0 | 37.4 |
GSM8K | 57.2 | 17.7 | 53.8 | 38.4 | 36.5 | 58.5 |
MATH | 3.5 | 11.8 | 10.2 | 10.1 | 10.7 | 21.7 |
BBH | 43.4 | 35.2 | 36.9 | 24.2 | 28.4 | 37.2 |
HellaSwag | 73.1 | 71.4 | 68.3 | 61.4 | 49.3 | 66.6 |
Winogrande | 74.4 | 66.8 | - | 60.3 | 56.8 | 66.2 |
ARC-C | 61.1 | 48.5 | - | 37.9 | 31.5 | 43.9 |
TruthfulQA | 44.5 | 33.1 | - | 39.4 | 39.7 | 45.9 |
C-Eval | 23.4 | 28.0 | 51.1 | 59.7 | 58.2 | 70.6 |
CMMLU | 24.2 | - | 51.1 | 57.8 | 55.1 | 70.3 |
This document introduces the Qwen2 series, an advancement from Qwen1.5, featuring pretrained and instruction-tuned models in five sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. These models exhibit enhanced performance, including state-of-the-art results in various benchmark evaluations and significant improvements in coding and mathematics. They also support extended context lengths up to 128K tokens.
Key advancements include training on data in 27 additional languages beyond English and Chinese, implementing Group Query Attention (GQA) across all model sizes for faster inference and reduced memory usage, and improved handling of code-switching in multilingual contexts. The Qwen2-72B model, in particular, outperforms its predecessor Qwen1.5-110B and other leading models in natural language understanding, knowledge acquisition, coding proficiency, and multilingual capabilities.
Safety assessments show that Qwen2-72B-Instruct is comparable to GPT-4 and superior to the Mistral-8x22B model in managing multilingual unsafe queries. The models have been open-sourced on Hugging Face and ModelScope under the Apache 2.0 license (except Qwen2-72B, which retains the Qianwen License), promoting broader application and commercial use. Future developments will include larger Qwen2 models and extensions to multimodal capabilities.
Introduction After months of effort, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:
Model Information
Multilingual Capabilities
Performance
Highlights
Base Language Model Evaluation The evaluation of base models focuses on performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capability using various datasets.
Instruction-tuned Model Evaluation We compare Qwen2 instruction-tuned models with other recent LLMs on several cross-lingual benchmarks and by human evaluation. The results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.