00:00:00

Share Your Feedback 🏝️

Model | Qwen2

Model | Qwen2

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Meta Contextual Position Encoding Next: Arithmetic with the Right Embeddings

Model | Qwen2

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-06-09

Qwen2

Not published yet.


Contents

TL;DR


성능 표로 대체

The datasets for evaluation include:

  • English Tasks: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
  • Coding Tasks: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
  • Math Tasks: GSM8K (4-shot), MATH (4-shot)
  • Chinese Tasks: C-Eval(5-shot), CMMLU (5-shot)
  • Multilingual Tasks: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)

Qwen Models

Datasets Qwen1.5-0.5B-Chat Qwen2-0.5B-Instruct Qwen1.5-1.8B-Chat Qwen2-1.5B-Instruct Qwen1.5-7B-Chat Qwen2-7B-Instruct Llama-3-70B-Instruct Qwen1.5-72B-Chat Qwen2-72B-Instruct Qwen2-57B-A14B-Instruct Mixtral-8x7B-Instruct-v0.1 Yi-1.5-34B-Chat Qwen1.5-32B-Chat
# Non-Emb Params 0.5B 0.5B 1.3B 1.3B 7.7B 7.6B - 72B 72B 14B 12B 34B 32B
English                          
MMLU 35.0 37.9 43.7 52.4 59.5 70.5 82.0 75.6 82.3 75.4 71.4 76.8 74.8
  (70.0) (75.8) (33.6) (40.3) (7.7) (9.3)   (1.05) (1.14) (5.39) (5.95) (2.26) (2.34)
MMLU-Pro - - - - 29.1 44.1 56.2 51.7 64.4 52.8 43.3 52.3 46.4
          (3.78) (5.80)   (0.72) (0.89) (3.77) (3.61) (1.54) (1.45)
GPQA - - - - 27.8 25.3 41.9 39.4 42.4 34.3 - - 30.8
          (3.61) (3.33)   (0.55) (0.59) (2.45) - - (0.96)
Theorem QA - - - - 14.1 25.3 42.5 28.8 44.4 33.1 - - 30.9
          (1.83) (3.33)   (0.40) (0.62) (2.37) - - (0.96)
MT-Bench - - - - 7.60 8.41 8.95 8.61 9.12 8.55 8.30 8.50 8.30
          (0.99) (1.11)   (0.12) (0.13) (0.61) (0.69) (0.25) (0.26)
Arena-Hard - - - - - - - 36.1 48.1 - - - -
          - -   (0.50) (0.67) - - - -
Coding                          
HumanEval 9.1 17.1 25.0 37.8 46.3 79.9 81.7 71.3 86.0 79.9 45.1 75.2 68.3
  (18.2) (34.2) (19.2) (29.1) (6.01) (10.5)   (0.99) (1.19) (5.71) (3.76) (2.21) (2.13)
MBPP - - - - 48.9 67.2 82.3 71.9 80.2 70.9 59.5 74.6 67.9
          (6.35) (8.84)   (1.00) (1.11) (5.07) (4.96) (2.19) (2.12)
MultiPL-E - - - 27.2 59.1 63.4 48.1 69.2 66.4 - - 50.7  
          (3.53) (7.78)   (0.67) (0.96) (4.74) - - (1.58)
Evalplus - - - - 44.8 70.3 75.2 66.9 79.0 71.6 48.5 - 63.6
          (5.82) (9.24)   (0.93) (1.10) (5.11) (4.04) - (1.99)
LiveCodeBench - - - - 6.0 26.6 29.3 17.9 35.7 25.5 12.3 - 15.2
          (0.78) (3.50)   (0.25) (0.50) (1.82) (1.03) - (0.47)
Mathematics                          
GSM8K 11.3 40.1 35.3 61.6 60.3 82.3 93.0 82.7 91.1 79.6 65.7 90.2 83.6
  (22.6) (80.2) (27.2) (47.4) (7.8) (10.8)   (1.15) (1.27) (5.69) (5.47) (2.65) (2.61)
MATH - - - - 23.2 49.6 50.4 42.5 59.7 49.1 30.7 50.1 42.4
          (3.01) (6.53)   (0.59) (0.83) (3.51) (2.56) (1.47) (1.33)
Chinese                          
C-Eval 37.2 45.2 55.3 63.8 67.3 77.2 61.6 76.1 83.8 80.5 - - 76.7
  (74.4) (90.4) (42.5) (49.1) (8.7) (10.2)   (1.05) (1.16) (5.75) - - (2.40)
AlignBench - - - - 6.20 7.21 7.42 7.28 8.27 7.36 5.70 7.20 7.19
          (0.80) (0.95)   (0.10) (0.11) (0.53) (0.47) (0.21) (0.22)
Multilingual                          
IFEval (Prompt Strict-Acc) 14.6 20.0 16.8 29.0 - - 77.3 55.8 77.6 - - - -
  (29.2) (40.0) (12.9) (22.3) - -   (0.78) (1.08) - - - -

(본 블로그 각주) The numbers in orange parentheses represent the benchmark score divided by the parameter size of each model.

Qwen-7B-Instruction

Datasets Llama-3-8B-Instruct Yi-1.5-9B-Chat GLM-4-9B-Chat Qwen1.5-7B-Chat Qwen2-7B-Instruct
English          
MMLU 68.4 69.5 72.4 59.5 70.5
MMLU-Pro 41.0 - - 29.1 44.1
GPQA 34.2 - - 27.8 25.3
TheroemQA 23.0 - - 14.1 25.3
MT-Bench 8.05 8.20 8.35 7.60 8.41
Coding          
HumanEval 62.2 66.5 71.8 46.3 79.9
MBPP 67.9 - - 48.9 67.2
MultiPL-E 48.5 - - 27.2 59.1
Evalplus 60.9 - - 44.8 70.3
LiveCodeBench 17.3 - - 6.0 26.6
Mathematics          
GSM8K 79.6 84.8 79.6 60.3 82.3
MATH 30.0 47.7 50.6 23.2 49.6
Chinese          
C-Eval 45.9 - 75.6 67.3 77.2
AlignBench 6.20 6.90 7.01 6.20 7.21
Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct          
Datasets Qwen1.5-0.5B-Chat Qwen2-0.5B-Instruct Qwen1.5-1.8B-Chat Qwen2-1.5B-Instruct  
MMLU 35.0 37.9 43.7 52.4  
HumanEval 9.1 17.1 25.0 37.8  
GSM8K 11.3 40.1 35.3 61.6  
C-Eval 37.2 45.2 55.3 63.8  
IFEval (Prompt Strict-Acc.) 14.6 20.0 16.8 29.0  

Qwen2-7B

Datasets Mistral-7B Gemma-7B Llama-3-8B Qwen1.5-7B Qwen2-7B
# Params 7.2B 8.5B 8.0B 7.7B 7.6B
# Non-emb Params 7.0B 7.8B 7.0B 6.5B 6.5B
English          
MMLU 64.2 64.6 66.6 61.0 70.3
MMLU-Pro 30.9 33.7 35.4 29.9 40.0
GPQA 24.7 25.7 25.8 26.7 31.8
Theorem QA 19.2 21.5 22.1 14.2 31.1
BBH 56.1 55.1 57.7 40.2 62.6
HellaSwag 83.2 82.2 82.1 78.5 80.7
Winogrande 78.4 79.0 77.4 71.3 77.0
ARC-C 60.0 61.1 59.3 54.2 60.6
TruthfulQA 42.2 44.8 44.0 51.1 54.2
Coding          
HumanEval 29.3 37.2 33.5 36.0 51.2
MBPP 51.1 50.6 53.9 51.6 65.9
EvalPlus 36.4 39.6 40.3 40.0 54.2
MultiPL-E 29.4 29.7 22.6 28.1 46.3
Mathematics          
GSM8K 52.2 46.4 56.0 62.5 79.9
MATH 13.1 24.3 20.5 20.3 44.2
Chinese          
C-Eval 47.4 43.6 49.5 74.1 83.2
CMMLU - - 50.8 73.1 83.9
Multilingual          
Multi-Exam 47.1 42.7 52.3 47.7 59.2
Multi-Understanding 63.3 58.3 68.6 67.6 72.0
Multi-Mathematics 26.3 39.1 36.3 37.3 57.5
Multi-Translation 23.3 31.2 31.9 28.4 31.5

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct

Datasets Phi-2 Gemma-2B MiniCPM Qwen1.5-1.8B Qwen2-0.5B Qwen2-1.5B
# Non-Emb Params 2.5B 2.0B 2.4B 1.3B 0.35B 1.3B
MMLU 52.7 42.3 53.5 46.8 45.4 56.5
MMLU-Pro - 15.9 - - 14.7 21.8
Theorem QA - - - - 8.9 15.0
HumanEval 47.6 22.0 50.0 20.1 22.0 31.1
MBPP 55.0 29.2 47.3 18.0 22.0 37.4
GSM8K 57.2 17.7 53.8 38.4 36.5 58.5
MATH 3.5 11.8 10.2 10.1 10.7 21.7
BBH 43.4 35.2 36.9 24.2 28.4 37.2
HellaSwag 73.1 71.4 68.3 61.4 49.3 66.6
Winogrande 74.4 66.8 - 60.3 56.8 66.2
ARC-C 61.1 48.5 - 37.9 31.5 43.9
TruthfulQA 44.5 33.1 - 39.4 39.7 45.9
C-Eval 23.4 28.0 51.1 59.7 58.2 70.6
CMMLU 24.2 - 51.1 57.8 55.1 70.3

This document introduces the Qwen2 series, an advancement from Qwen1.5, featuring pretrained and instruction-tuned models in five sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. These models exhibit enhanced performance, including state-of-the-art results in various benchmark evaluations and significant improvements in coding and mathematics. They also support extended context lengths up to 128K tokens.

Key advancements include training on data in 27 additional languages beyond English and Chinese, implementing Group Query Attention (GQA) across all model sizes for faster inference and reduced memory usage, and improved handling of code-switching in multilingual contexts. The Qwen2-72B model, in particular, outperforms its predecessor Qwen1.5-110B and other leading models in natural language understanding, knowledge acquisition, coding proficiency, and multilingual capabilities.

Safety assessments show that Qwen2-72B-Instruct is comparable to GPT-4 and superior to the Mistral-8x22B model in managing multilingual unsafe queries. The models have been open-sourced on Hugging Face and ModelScope under the Apache 2.0 license (except Qwen2-72B, which retains the Qianwen License), promoting broader application and commercial use. Future developments will include larger Qwen2 models and extensions to multimodal capabilities.

Introduction After months of effort, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:

  • Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B.
  • Training on data in 27 additional languages besides English and Chinese.
  • State-of-the-art performance in numerous benchmark evaluations.
  • Significantly improved performance in coding and mathematics.
  • Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.

Model Information

  • All model sizes now apply Group Query Attention (GQA) for faster speed and less memory usage. The context length capabilities of instruction-tuned models, assessed through the Needle in a Haystack task, include up to 128K tokens for Qwen2-7B-Instruct and Qwen2-72B-Instruct models.
  • The Qwen2 series includes base and instruction-tuned models of 5 sizes:

Multilingual Capabilities

  • Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum. We explicitly highlight the inclusion of 27 additional languages:
    • Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
    • Eastern & Central Europe: Russian, Czech, Polish
    • Middle East: Arabic, Persian, Hebrew, Turkish
    • Eastern Asia: Japanese, Korean
    • South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
    • Southern Asia: Hindi, Bengali, Urdu
  • Evaluations confirm enhanced proficiency in handling code-switching across languages.

Performance

  • Comparative assessments reveal substantial enhancements in performance for large-scale models (70B+ parameters) relative to Qwen1.5. The Qwen2-72B model exhibits superior performance in natural language understanding, knowledge acquisition, coding proficiency, mathematical skills, and multilingual abilities. Notably, it surpasses the performance of Qwen1.5-110B despite having fewer parameters.
  • Our post-training phase is designed to enhance the model’s intelligence, bringing it closer to human capabilities. This phase employs various automated alignment strategies and training methods to improve coding, mathematics, reasoning, instruction following, and multilingual understanding.

Highlights

  • Coding & Mathematics: Significant improvements have been made in Qwen2-72B-Instruct for various programming languages and solving mathematical problems using extensive and high-quality datasets.
  • Long Context Understanding: All instruction-tuned models have been trained on 32k length contexts, extrapolated to longer contexts using techniques like YARN or Dual Chunk Attention. Qwen2-72B-Instruct can handle information extraction tasks within a 128k context.
  • Safety and Responsibility: Evaluation of the model’s safety reveals that Qwen2-72B-Instruct performs comparably to GPT-4 in terms of safety and significantly outperforms the Mistral-8x22B model in handling multilingual unsafe queries.
  • Developing with Qwen2: All models have been released on Hugging Face and ModelScope. For detailed usage and more information, refer to the model cards and our official documentation.

Base Language Model Evaluation The evaluation of base models focuses on performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capability using various datasets.

Instruction-tuned Model Evaluation We compare Qwen2 instruction-tuned models with other recent LLMs on several cross-lingual benchmarks and by human evaluation. The results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.

Previous: Meta Contextual Position Encoding Next: Arithmetic with the Right Embeddings

post contain ""

    No matching posts found containing ""