00:00:00

Model | Qwen2

https://dsdanielpark.github.io https://github.com/dsdanielpark

Model | Qwen2

MinWoo(Daniel) Park | Tech Blog

Created: 2024-06-09 08:34:21 +0000

Last modified: 2024-09-05 20:56:50 +0900

Model | Qwen2

Related Project: Private
Category: Paper Review
Date: 2024-06-09

Qwen2

Not published yet.

url: https://arxiv.org/abs/
pdf: https://arxiv.org/pdf/
html https://arxiv.org/html/
blog: https://qwenlm.github.io/blog/qwen2/

Contents

Qwen2
- TL;DR

TL;DR

성능 표로 대체

The datasets for evaluation include:

English Tasks: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
Coding Tasks: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
Math Tasks: GSM8K (4-shot), MATH (4-shot)
Chinese Tasks: C-Eval(5-shot), CMMLU (5-shot)
Multilingual Tasks: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)

Qwen Models

Datasets	Qwen1.5-0.5B-Chat	Qwen2-0.5B-Instruct	Qwen1.5-1.8B-Chat	Qwen2-1.5B-Instruct	Qwen1.5-7B-Chat	Qwen2-7B-Instruct	Llama-3-70B-Instruct	Qwen1.5-72B-Chat	Qwen2-72B-Instruct	Qwen2-57B-A14B-Instruct	Mixtral-8x7B-Instruct-v0.1	Yi-1.5-34B-Chat	Qwen1.5-32B-Chat
# Non-Emb Params	0.5B	0.5B	1.3B	1.3B	7.7B	7.6B	-	72B	72B	14B	12B	34B	32B
English
MMLU	35.0	37.9	43.7	52.4	59.5	70.5	82.0	75.6	82.3	75.4	71.4	76.8	74.8
	(70.0)	(75.8)	(33.6)	(40.3)	(7.7)	(9.3)		(1.05)	(1.14)	(5.39)	(5.95)	(2.26)	(2.34)
MMLU-Pro	-	-	-	-	29.1	44.1	56.2	51.7	64.4	52.8	43.3	52.3	46.4
					(3.78)	(5.80)		(0.72)	(0.89)	(3.77)	(3.61)	(1.54)	(1.45)
GPQA	-	-	-	-	27.8	25.3	41.9	39.4	42.4	34.3	-	-	30.8
					(3.61)	(3.33)		(0.55)	(0.59)	(2.45)	-	-	(0.96)
Theorem QA	-	-	-	-	14.1	25.3	42.5	28.8	44.4	33.1	-	-	30.9
					(1.83)	(3.33)		(0.40)	(0.62)	(2.37)	-	-	(0.96)
MT-Bench	-	-	-	-	7.60	8.41	8.95	8.61	9.12	8.55	8.30	8.50	8.30
					(0.99)	(1.11)		(0.12)	(0.13)	(0.61)	(0.69)	(0.25)	(0.26)
Arena-Hard	-	-	-	-	-	-	-	36.1	48.1	-	-	-	-
					-	-		(0.50)	(0.67)	-	-	-	-
Coding
HumanEval	9.1	17.1	25.0	37.8	46.3	79.9	81.7	71.3	86.0	79.9	45.1	75.2	68.3
	(18.2)	(34.2)	(19.2)	(29.1)	(6.01)	(10.5)		(0.99)	(1.19)	(5.71)	(3.76)	(2.21)	(2.13)
MBPP	-	-	-	-	48.9	67.2	82.3	71.9	80.2	70.9	59.5	74.6	67.9
					(6.35)	(8.84)		(1.00)	(1.11)	(5.07)	(4.96)	(2.19)	(2.12)
MultiPL-E	-	-	-	27.2	59.1	63.4	48.1	69.2	66.4	-	-	50.7
					(3.53)	(7.78)		(0.67)	(0.96)	(4.74)	-	-	(1.58)
Evalplus	-	-	-	-	44.8	70.3	75.2	66.9	79.0	71.6	48.5	-	63.6
					(5.82)	(9.24)		(0.93)	(1.10)	(5.11)	(4.04)	-	(1.99)
LiveCodeBench	-	-	-	-	6.0	26.6	29.3	17.9	35.7	25.5	12.3	-	15.2
					(0.78)	(3.50)		(0.25)	(0.50)	(1.82)	(1.03)	-	(0.47)
Mathematics
GSM8K	11.3	40.1	35.3	61.6	60.3	82.3	93.0	82.7	91.1	79.6	65.7	90.2	83.6
	(22.6)	(80.2)	(27.2)	(47.4)	(7.8)	(10.8)		(1.15)	(1.27)	(5.69)	(5.47)	(2.65)	(2.61)
MATH	-	-	-	-	23.2	49.6	50.4	42.5	59.7	49.1	30.7	50.1	42.4
					(3.01)	(6.53)		(0.59)	(0.83)	(3.51)	(2.56)	(1.47)	(1.33)
Chinese
C-Eval	37.2	45.2	55.3	63.8	67.3	77.2	61.6	76.1	83.8	80.5	-	-	76.7
	(74.4)	(90.4)	(42.5)	(49.1)	(8.7)	(10.2)		(1.05)	(1.16)	(5.75)	-	-	(2.40)
AlignBench	-	-	-	-	6.20	7.21	7.42	7.28	8.27	7.36	5.70	7.20	7.19
					(0.80)	(0.95)		(0.10)	(0.11)	(0.53)	(0.47)	(0.21)	(0.22)
Multilingual
IFEval (Prompt Strict-Acc)	14.6	20.0	16.8	29.0	-	-	77.3	55.8	77.6	-	-	-	-
	(29.2)	(40.0)	(12.9)	(22.3)	-	-		(0.78)	(1.08)	-	-	-	-

(본 블로그 각주) The numbers in orange parentheses represent the benchmark score divided by the parameter size of each model.

Qwen-7B-Instruction

Datasets	Llama-3-8B-Instruct	Yi-1.5-9B-Chat	GLM-4-9B-Chat	Qwen1.5-7B-Chat	Qwen2-7B-Instruct
English
MMLU	68.4	69.5	72.4	59.5	70.5
MMLU-Pro	41.0	-	-	29.1	44.1
GPQA	34.2	-	-	27.8	25.3
TheroemQA	23.0	-	-	14.1	25.3
MT-Bench	8.05	8.20	8.35	7.60	8.41
Coding
HumanEval	62.2	66.5	71.8	46.3	79.9
MBPP	67.9	-	-	48.9	67.2
MultiPL-E	48.5	-	-	27.2	59.1
Evalplus	60.9	-	-	44.8	70.3
LiveCodeBench	17.3	-	-	6.0	26.6
Mathematics
GSM8K	79.6	84.8	79.6	60.3	82.3
MATH	30.0	47.7	50.6	23.2	49.6
Chinese
C-Eval	45.9	-	75.6	67.3	77.2
AlignBench	6.20	6.90	7.01	6.20	7.21
Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct
Datasets	Qwen1.5-0.5B-Chat	Qwen2-0.5B-Instruct	Qwen1.5-1.8B-Chat	Qwen2-1.5B-Instruct
MMLU	35.0	37.9	43.7	52.4
HumanEval	9.1	17.1	25.0	37.8
GSM8K	11.3	40.1	35.3	61.6
C-Eval	37.2	45.2	55.3	63.8
IFEval (Prompt Strict-Acc.)	14.6	20.0	16.8	29.0

Qwen2-7B

Datasets	Mistral-7B	Gemma-7B	Llama-3-8B	Qwen1.5-7B	Qwen2-7B
# Params	7.2B	8.5B	8.0B	7.7B	7.6B
# Non-emb Params	7.0B	7.8B	7.0B	6.5B	6.5B
English
MMLU	64.2	64.6	66.6	61.0	70.3
MMLU-Pro	30.9	33.7	35.4	29.9	40.0
GPQA	24.7	25.7	25.8	26.7	31.8
Theorem QA	19.2	21.5	22.1	14.2	31.1
BBH	56.1	55.1	57.7	40.2	62.6
HellaSwag	83.2	82.2	82.1	78.5	80.7
Winogrande	78.4	79.0	77.4	71.3	77.0
ARC-C	60.0	61.1	59.3	54.2	60.6
TruthfulQA	42.2	44.8	44.0	51.1	54.2
Coding
HumanEval	29.3	37.2	33.5	36.0	51.2
MBPP	51.1	50.6	53.9	51.6	65.9
EvalPlus	36.4	39.6	40.3	40.0	54.2
MultiPL-E	29.4	29.7	22.6	28.1	46.3
Mathematics
GSM8K	52.2	46.4	56.0	62.5	79.9
MATH	13.1	24.3	20.5	20.3	44.2
Chinese
C-Eval	47.4	43.6	49.5	74.1	83.2
CMMLU	-	-	50.8	73.1	83.9
Multilingual
Multi-Exam	47.1	42.7	52.3	47.7	59.2
Multi-Understanding	63.3	58.3	68.6	67.6	72.0
Multi-Mathematics	26.3	39.1	36.3	37.3	57.5
Multi-Translation	23.3	31.2	31.9	28.4	31.5

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct

Datasets	Phi-2	Gemma-2B	MiniCPM	Qwen1.5-1.8B	Qwen2-0.5B	Qwen2-1.5B
# Non-Emb Params	2.5B	2.0B	2.4B	1.3B	0.35B	1.3B
MMLU	52.7	42.3	53.5	46.8	45.4	56.5
MMLU-Pro	-	15.9	-	-	14.7	21.8
Theorem QA	-	-	-	-	8.9	15.0
HumanEval	47.6	22.0	50.0	20.1	22.0	31.1
MBPP	55.0	29.2	47.3	18.0	22.0	37.4
GSM8K	57.2	17.7	53.8	38.4	36.5	58.5
MATH	3.5	11.8	10.2	10.1	10.7	21.7
BBH	43.4	35.2	36.9	24.2	28.4	37.2
HellaSwag	73.1	71.4	68.3	61.4	49.3	66.6
Winogrande	74.4	66.8	-	60.3	56.8	66.2
ARC-C	61.1	48.5	-	37.9	31.5	43.9
TruthfulQA	44.5	33.1	-	39.4	39.7	45.9
C-Eval	23.4	28.0	51.1	59.7	58.2	70.6
CMMLU	24.2	-	51.1	57.8	55.1	70.3

This document introduces the Qwen2 series, an advancement from Qwen1.5, featuring pretrained and instruction-tuned models in five sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. These models exhibit enhanced performance, including state-of-the-art results in various benchmark evaluations and significant improvements in coding and mathematics. They also support extended context lengths up to 128K tokens.

Key advancements include training on data in 27 additional languages beyond English and Chinese, implementing Group Query Attention (GQA) across all model sizes for faster inference and reduced memory usage, and improved handling of code-switching in multilingual contexts. The Qwen2-72B model, in particular, outperforms its predecessor Qwen1.5-110B and other leading models in natural language understanding, knowledge acquisition, coding proficiency, and multilingual capabilities.

Safety assessments show that Qwen2-72B-Instruct is comparable to GPT-4 and superior to the Mistral-8x22B model in managing multilingual unsafe queries. The models have been open-sourced on Hugging Face and ModelScope under the Apache 2.0 license (except Qwen2-72B, which retains the Qianwen License), promoting broader application and commercial use. Future developments will include larger Qwen2 models and extensions to multimodal capabilities.

Introduction After months of effort, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:

Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B.
Training on data in 27 additional languages besides English and Chinese.
State-of-the-art performance in numerous benchmark evaluations.
Significantly improved performance in coding and mathematics.
Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.

Model Information

All model sizes now apply Group Query Attention (GQA) for faster speed and less memory usage. The context length capabilities of instruction-tuned models, assessed through the Needle in a Haystack task, include up to 128K tokens for Qwen2-7B-Instruct and Qwen2-72B-Instruct models.
The Qwen2 series includes base and instruction-tuned models of 5 sizes:

Multilingual Capabilities

Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum. We explicitly highlight the inclusion of 27 additional languages:
- Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
- Eastern & Central Europe: Russian, Czech, Polish
- Middle East: Arabic, Persian, Hebrew, Turkish
- Eastern Asia: Japanese, Korean
- South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
- Southern Asia: Hindi, Bengali, Urdu
Evaluations confirm enhanced proficiency in handling code-switching across languages.

Performance

Comparative assessments reveal substantial enhancements in performance for large-scale models (70B+ parameters) relative to Qwen1.5. The Qwen2-72B model exhibits superior performance in natural language understanding, knowledge acquisition, coding proficiency, mathematical skills, and multilingual abilities. Notably, it surpasses the performance of Qwen1.5-110B despite having fewer parameters.
Our post-training phase is designed to enhance the model’s intelligence, bringing it closer to human capabilities. This phase employs various automated alignment strategies and training methods to improve coding, mathematics, reasoning, instruction following, and multilingual understanding.

Highlights

Coding & Mathematics: Significant improvements have been made in Qwen2-72B-Instruct for various programming languages and solving mathematical problems using extensive and high-quality datasets.
Long Context Understanding: All instruction-tuned models have been trained on 32k length contexts, extrapolated to longer contexts using techniques like YARN or Dual Chunk Attention. Qwen2-72B-Instruct can handle information extraction tasks within a 128k context.
Safety and Responsibility: Evaluation of the model’s safety reveals that Qwen2-72B-Instruct performs comparably to GPT-4 in terms of safety and significantly outperforms the Mistral-8x22B model in handling multilingual unsafe queries.
Developing with Qwen2: All models have been released on Hugging Face and ModelScope. For detailed usage and more information, refer to the model cards and our official documentation.

Base Language Model Evaluation The evaluation of base models focuses on performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capability using various datasets.

Instruction-tuned Model Evaluation We compare Qwen2 instruction-tuned models with other recent LLMs on several cross-lingual benchmarks and by human evaluation. The results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.

post contain ""

No matching posts found containing ""

Model | Qwen2

Model | Qwen2

Model | Qwen2

Qwen2

TL;DR

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Model | Qwen2

Model | Qwen2

Qwen2

TL;DR

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views