00:00:00

Share Your Feedback 🏝️

Model | Qwen 1

Model | Qwen 1

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Bloom Next: Model | Google - PALM

Model | Qwen 1

  • Related Project: private
  • Category: Paper Review
  • Date: 2023-08-20

Qwen Technical Report

  • url: https://arxiv.org/abs/2309.16609
  • pdf: https://arxiv.org/pdf/2309.16609
  • hugging face: https://huggingface.co/Qwen
  • github: https://github.com/QwenLM/Qwen-7B
  • abstract: Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first instaTextGenerationLLMent of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

Contents

TL;DR


  1. QWEN은 다양한 파라미터의 모델을 포함하며, 각각 기본 언어 모델, 채팅 모델, 코딩 및 수학에 특화된 모델로 구성
  2. 다양한 데이터셋와 벤치마크에서 향상된 성능을 입증
  3. 휴먼 선호(human preference)도에 맞춘 보상 모델과 강화 학습을 통해 성능 최적화

서론

QWEN 팀은 대규모 언어모델(LLM)의 잠재력을 극대화하기 위해 QWEN 시리즈를 개발했습니다. 이 시리즈는 다양한 파라미터로 구성된 모델들로 이루어져 있으며, 각각의 모델은 다른 용도에 맞게 최적화되었습니다. 본 논문에서는 QWEN의 설계, 학습 방법, 성능 평가 및 다양한 벤치마크에서의 성능을 소개합니다.


Pre-training

데이터셋 및 벤치마크

QWEN 모델은 3조 개의 다양한 텍스트와 코드로 광범위하게 학습되었습니다. 데이터셋은 공공 웹 문서, 백과사전, 책, 코드 등으로 구성되었으며, 다국어 데이터를 포함하여 다양한 유형과 도메인을 포괄합니다.

데이터 전처리 및 토큰화

데이터 전처리 과정에서는 HTML에서 텍스트를 추출하고, 언어 식별 도구를 사용하여 언어를 결정한 후 정규화 및 중복 제거 기법을 적용했습니다. 특히 MinHash 및 LSH 알고리즘을 사용하여 중복 데이터를 제거했습니다. 토큰화는 BPE(Byte Pair Encoding) 방법을 사용하였으며, 최종 어휘 크기는 약 152,000개입니다.


모델 구조

QWEN 모델은 변형된 트랜스포머 아키텍처를 사용하며, 주요 변경 사항은 다음과 같습니다.

  • Embedding 및 출력 투영: 입력 임베딩과 출력 투영의 가중치를 묶지 않고 분리된 임베딩 방식을 사용.
  • 위치 임베딩: RoPE(Rotary Positional Embedding) 사용.
  • 편향: 대부분의 레이어에서 편향을 제거하고, QKV 레이어에 편향을 추가.
  • 사전 정규화 및 RMSNorm: 사전 정규화 및 RMSNorm 사용.
  • 활성화 함수: SwiGLU(Swish + GLU) 사용.


학습 과정

QWEN 모델의 학습은 자동 회귀 언어 모델링 방식을 따릅니다. 모델은 2048개의 컨텍스트 길이로 훈련되었으며, Flash Attention 기법을 사용하여 계산 효율성을 개선하고 메모리 사용량을 줄였습니다. 최적화 도구로 AdamW를 사용하였고, BFloat16 혼합 Precision를 적용하여 훈련 안정성을 높였습니다.


지도 학습(SFT)

QWEN-CHAT 모델은 휴먼의 대화 스타일에 맞춰 파인튜닝되었습니다. training dataset는 폭력, 편견, 음란물 등의 안전 문제와 관련된 데이터를 포함하여 엄선되었습니다. ChatML 형식을 사용하여 다양한 메타데이터와 대화 내용을 효과적으로 구분하였습니다.

휴먼 피드백을 통한 강화 학습(RLHF)

QWEN 모델은 휴먼의 선호도에 맞춘 보상 모델을 사용하여 강화 학습을 진행했습니다. 보상 모델은 pre-trained 언어 모델을 기반으로 하여, 비교 데이터셋를 사용해 학습되었습니다. 정책 모델, 가치 모델, 참조 모델, 보상 모델을 사용하여 PPO(Proximal Policy Optimization) 알고리즘을 적용했습니다.


1 INTRODUCTION

(통상적인 LLM의 능력 및 잠재적이 가능성) Large language models (LLMs) have revolutionized the field of artificial intelligence (AI) by providing a powerful foundation for complex reasoning and problem-solving tasks. These models have the ability to compress vast knowledge into neural networks, making them incredibly versatile agents. With a chat interface, LLMs can perform tasks that were previously thought to be the exclusive domain of humans, especially those involving creativity and expertise. They can engage in natural language conversations with humans, answering questions, providing information, and even generating creative content such as stories, poems, and music. This has led to the development of a wide range of applications, from chatbots and virtual assistants to language translation and summarization tools.

LLMs are not just limited to language tasks. They can also function as a generalist agent, collaborating with external systems, tools, and models to achieve the objectives set by humans. For example, LLMs can understand multimodal instructions, execute code, use tools, and more. This opens up a whole new world of possibilities for AI applications, from autonomous vehicles and robotics to healthcare and finance. As these models continue to evolve and improve, we can expect to see even more innovative and exciting applications in the years to come. Whether it’s helping us solve complex problems, creating new forms of entertainment, or transforming the way we live and work, LLMs are poised to play a central role in shaping the future of AI.

Figure 1: Model Lineage of the Qwen Series.

(출시 모델 종류) We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens. We then use SFT and RLHF to align QWEN to human preference and thus we have QWEN-CHAT and specifically its improved version QWEN-CHAT-RLHF. Additionally, we also develop specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT based on QWEN with similar techniques. Note that we previously released the multimodal LLM, QWEN-VL and QWEN-VL-CHAT, which are also based on our QWEN base models.

  1. QWEN Models:
    • Pretraining: Language models, such as QWEN, are pretrained on massive datasets containing trillions of tokens.
    • Alignment: SFT (Self-consistency and Fine-tuning) and RLHF (Reinforcement Learning with Human Feedback) techniques are used to align QWEN to human preference.
  2. QWEN-CHAT Models:
    • Development: QWEN-CHAT is a variant of the QWEN model that is specifically tailored for chat-based applications.
    • Improvement: QWEN-CHAT-RLHF is an improved version of QWEN-CHAT that benefits from RLHF alignment techniques.
  3. Specialized Models:
    • Coding Models:
      • CODE-QWEN: A model developed for coding-related tasks, built on the QWEN base model.
      • CODE-QWEN-CHAT: A chat-specific variant of CODE-QWEN.
    • Mathematics Model:
      • MATH-QWEN-CHAT: A model specialized for mathematics, developed based on QWEN-CHAT.
  4. Multimodal Models:
    • QWEN-VL and QWEN-VL-CHAT: Multimodal LLMs based on QWEN models, designed to handle a variety of data types beyond text, including vision and language. QWEN-VL-CHAT is the chat-specific version of QWEN-VL.

Despite their impressive capabilities, LLMs are often criticized for their lack of Reproducibility(재현가능성), Steerability(조작성), and Accessibility(서비스 제공사의 접근성) to service providers. In this work, we are pleased to present and release the initial version of our LLM series, QWEN. QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts. The model series include the base pretrained language models, chat models finetuned with human alignment techniques, i.e., supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), etc., as well as specialized models in coding and math. The details are outlined below:

  1. The base language models, namely QWEN, have undergone extensive training using up to 3 trillion tokens of diverse texts and codes, encompassing a wide range of areas. These models have consistently demonstrated superior performance across a multitude of downstream tasks, even when compared to their more significantly larger counterparts.
  2. The QWEN-CHAT models have been carefully finetuned on a curated dataset relevant to task performing, chat, tool use, agent, safety, etc. The benchmark evaluation demonstrates that the SFT models can achieve superior performance. Furthermore, we have trained reward models to mimic human preference and applied them in RLHF for chat models that can produce responses preferred by humans. Through the human evaluation of a challenging test, we find that QWEN-CHAT models trained with RLHF are highly competitive, still falling behind GPT-4 on our benchmark.
  3. In addition, we present specialized models called CODE-QWEN, which includes CODE-QWEN-7B and CODE-QWEN-14B, as well as their chat models, CODE-QWEN-14B-CHAT and CODE-QWEN-7B-CHAT. Specifically, CODE-QWEN has been pre-trained on extensive datasets of code and further fine-tuned to handle conversations related to code generation, debugging, and interpretation. The results of experiments conducted on benchmark datasets, such as HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and HumanEvalPack (Muennighoff et al., 2023), demonstrate the high level of proficiency of CODE-QWEN in code understanding and generation.
  4. This research additionally introduces MATH-QWEN-CHAT specifically designed to tackle mathematical problems. Our results show that both MATH-QWEN-7B-CHAT and MATH-QWEN-14B-CHAT outperform open-sourced models in the same sizes with large margins and are approaching GPT-3.5 on math-related benchmark datasets such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021).
  5. Besides, we have open-sourced QWEN-VL and QWEN-VL-CHAT, which have the versatile ability to comprehend visual and language instructions. These models outperform the current open-source vision-language models across various evaluation benchmarks and support text recognition and visual grounding in both Chinese and English languages. Moreover, these models enable multi-image conversations and storytelling. Further details can be found in Bai et al. (2023).


Qwen팀은 LLM의 잠재적인 가능성에도 불구하고, Reproducibility(재현가능성), Steerability(조작성), and Accessibility(서비스 제공사의 접근성)으로 인한 제한 사항들이 제기되고 있는데, 이런 상황에서 Qwen을 통해 다음과 같은 가능성을 보였다고 합니다.

  • 3조 개의 다양한 텍스트와 코드를 사용하여 광범위한 학습을 거쳐, 훨씬 더 큰 규모의 모델과 비교했을 때 7B 모델이 여러 downstream 작업에서 훨씬 더 큰 오픈 LLM보다 우수함을 보였다고 함.
  • 모델은 작업 수행, 채팅, 도구 사용, 상담원, 안전 등과 관련된 엄선된 데이터셋를 기반으로 세심하게 파인튜닝되었는데, 파인튜닝 이후 벤치마크 평가에서 SFT 모델이 우수한 성능을 달성할 수 있음을 보여주었음.
  • 사람의 선호도를 모방하도록 보상 모델을 학습시켜 사람이 선호하는 응답을 생성할 수 있는 채팅 모델을 만들기 위해 RLHF에 적용하였으며, 적용 결과 까다로운 테스트에 대한 인적 평가를 통해 RLHF로 훈련된 QWEN-CHAT 모델이 벤치마크에서 GPT-4보다는 뒤처지지만 다른 오픈 LLM보다 경쟁력이 높다는 것을 확인하였음.
  • CODE-QWEN은 광범위한 코드 데이터셋를 pre-training하고 코드 생성, 디버깅 및 해석과 관련된 대화를 처리할 수 있도록 더욱 세밀하게 조정하였음.
  • HumanEval(Chen 외, 2021), MBPP(Austin 외, 2021), HumanEvalPack(Muennighoff 외, 2023) 등의 벤치마크 데이터셋에서 수행한 실험 결과는 코드 이해 및 생성에서 CODE-QWEN의 높은 수준의 숙련도를 보였다고 함.
  • MATH-QWEN-7B-CHAT과 MATH-QWEN-14B-CHAT 모두 동일한 모델 크기에서 오픈소스 모델을 큰 차이로 능가하였고, GSM8K(Cobbe et al., 2021) 및 MATH(Hendrycks et al., 2021)와 같은 수학 관련 벤치마크 데이터셋에서 GPT-3.5에 근접하였음을 보였음.
  • 다양한 평가 벤치마크에서 현재 오픈 소스 비전 언어 모델보다 성능이 뛰어나며, 중국어와 영어 모두에서 텍스트 인식 및 시각적 근거를 지원하고, 다중 이미지 대화와 스토리텔링을 지원함.

Now, we officially open-source the 14B-parameter and 7B-parameter base pretrained models QWEN and aligned chat models QWEN-CHAT2. This release aims at providing more comprehensive and powerful LLMs at developer- or application-friendly scales.

The structure of this report is as follows: Section 2 describes our approach to pretraining and results of QWEN. Section 3 covers our methodology for alignment and reports the results of both automatic evaluation and human evaluation. Additionally, this section describes details about our efforts in building chat models capable of tool use, code interpreter, and agent. In Sections 4 and 5, we delve into specialized models of coding and math and their performance. Section 6 provides an overview of relevant related work, and Section 7 concludes this paper and points out our future work.


2 PRETRAINING

The pretraining stage involves learning vast amount of data to acquire a comprehensive understanding of the world and its various complexities. This includes not only basic language capabilities but also advanced skills such as arithmetic, coding, and logical reasoning. In this section, we introduce the data, the model design and scaling, as well as the comprehensive evaluation results on benchmark datasets.

  • 사전 교육 단계에서는 방대한 양의 데이터를 학습하여 세상과 다양한 복잡성에 대한 포괄적인 이해를 습득한다고 하며, 기본적인 언어 능력뿐만 아니라 산술, 코딩, 논리적 인퍼런스과 같은 기술도 포함된다고 합니다.
  • 이 섹션에서는 데이터, 모델 설계 및 확장, 벤치마크 데이터셋에 대한 종합적인 평가 결과를 소개함.

2.1 DATA

The size of data has proven to be a crucial factor in developing a robust large language model, as highlighted in previous research (Hoffmann et al., 2022; Touvron et al., 2023b). To create an effective pretraining dataset, it is essential to ensure that the data are diverse and cover a wide range of types, domains, and tasks. Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese.

To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing procedure. For public web data, we extract text from HTML and use language identification tools to determine the language. To increase the diversity of our data, we employ deduplication techniques, including exact-match deduplication after normalization and fuzzy deduplication using MinHash and LSH algorithms. To filter out low-quality data, we employ a combination of rule-based and machine-learning-based methods. Specifically, we use multiple models to score the content, including language models, text-quality scoring models, and models for identifying potentially offensive or inappropriate content. We also manually sample texts from various sources and review them to ensure their quality. To further enhance the quality of our data, we selectively up-sample data from certain sources, to ensure that our models are trained on a diverse range of high-quality content. In recent studies (Zeng et al., 2022; Aribandi et al., 2021; Raffel et al., 2020), it has been demonstrated that pretraining language models with multi-task instructions can enhance their zero-shot and few-shot performance. To further enhance the performance of our model, we have incorporated high-quality instruction data into our pretraining process. To safeguard the integrity of our benchmark assessment, we have adopted a similar approach as Brown et al. (2020) and meticulously eliminated any instruction samples that exhibit a 13-gram overlap with any data present in the test sets utilized in our evaluation. Given the large number of downstream tasks, it is not feasible to repeat this filtering process for all tasks. Instead, we have made sure that the instruction data for the reported tasks have undergone our filtering process to ensure their accuracy and reliability. Finally, we have built a dataset of up to 3 trillion tokens.


  • Facebook의 데이터셋를 사용해서 3조 토큰 데이터 생성
    • 데이터 다양성과 품질 확보: 웹데이터에서 언어 탐지, 정규화, 중복 제거(MinHash 및 LSH 알고리즘 활용) 등의 처리를 수행
    • 고품질 확보: 규칙 및 머신러닝 기반 모델로 텍스트 품질 평가 및 부적절한 콘텐츠 식별, 특정 소스 업샘플링으로 비중을 높임
    • 명령어 데이터 통합: 다중 작업을 위한 명령어 데이터 통합
    • 벤치마크 무결성: 중복 명령어 제거 및 정확성 확보
    • 작업별 필터링: 보고된 작업의 명령어 데이터 정확성과 신뢰성을 확인하여 필터링


  • 데이터의 크기는 우수한 대규모 언어모델을 개발하는 데 중요한 요소임이 입증(Hoffmann 외., 2022; Touvron 외., 2023b)되었으며, 효과적인 pre-training 데이터셋를 만들려면 데이터가 다양하고 광범위한 유형, 도메인 및 작업을 포괄하는지 확인하는 것이 필수적이라고 보고합니다.
    • Facebook의 데이터셋은 이런 요구 사항을 충족하도록 설계되었으며 공개 웹 문서, 백과사전, 책, 코드 등을 포함하고 있고, 데이터의 상당 부분이 영어와 중국어로 되어 있는 다국어 데이터셋로 pre-training 데이터의 품질을 보장하기 위해 포괄적인 데이터 전처리 절차 개발
  • 공개 웹 데이터의 경우, HTML에서 텍스트를 추출하고 언어 식별 도구를 사용하여 언어를 결정한 뒤, 데이터의 다양성을 높이기 위해 정규화 후 정확히 일치하는 중복 제거, MinHash 및 LSH 알고리즘을 사용한 퍼지 중복 제거 등의 중복 제거 기법 사용 (웹데이터: 언어 탐지, 정규화 이후 중복 제거, MinHash 및 LSH 알고리즘을 사용한 퍼지 중복 제거)
  • 품질이 낮은 데이터를 걸러내기 위해 규칙 기반 방법과 머신러닝 기반 방법을 조합하여 사용하였고, 텍스트 품질 점수 모델, 잠재적으로 불쾌하거나 부적절한 콘텐츠를 식별하는 모델 등 여러 모델을 사용하여 콘텐츠 레이블링 수행 (고품질 확보: 텍스트 품질 점수, 부적절한 콘텐츠 식별하는 여러 모델 사용하여 콘텐츠에 점수를 매기고, 다양한 출처에서 수동으로 텍스트를 샘플링하고 검토하여 품질을 보장)
  • 데이터의 품질을 더욱 향상시키기 위해 특정 소스의 데이터를 선택적으로 업샘플링하여 모델이 다양한 고품질 콘텐츠에 대해 학습할 수 있도록 수정 (고품질 데이터의 경우 업 샘플링하여, 비중을 높임)
  • 최근의 연구(Zeng 외., 2022; Aribandi 외., 2021; Raffel 외., 2020)에 따르면 다중 작업 지침으로 언어 모델을 pre-training하면 0-shot 및 소수 샷 성능을 향상시킬 수 있음이 입증하였는데,
    • Lionbridge는 모델의 성능을 더욱 향상시키기 위해 고품질의 명령어 데이터를 pre-training 프로세스와 통합하였고,
    • 벤치마크 평가의 무결성을 보호하기 위해 브라운 외(2020)와 유사한 접근 방식을 채택했다고 합니다.
    • 평가에 사용된 테스트 세트에 존재하는 데이터와 13그램 이상 중복되는 명령어 샘플을 꼼꼼하게 제거하였다고 함. (벤치마크 무결성 확보)
  • 많은 수의 downstream 작업을 고려할 때, 모든 작업에 대해 이 필터링 프로세스를 반복하는 것은 불가능하기때문에, 보고된 작업에 대한 명령어 데이터의 정확성과 신뢰성을 보장하기 위해 필터링 프로세스를 거쳤는지 확인 함. Instead, we have made sure that the instruction data for the reported tasks have undergone our filtering process to ensure their accuracy and reliability.
  • 결론적으로 Dataset of up to 3 trillion tokens 생성함.

2.2 TOKENIZATION

The design of vocabulary significantly impacts the training efficiency and the downstream task performance. In this study, we utilize byte pair encoding (BPE) as our tokenization method, following GPT-3.5 and GPT-4. We start with the open-source fast BPE tokenizer, tiktoken (Jain, 2022), and select the vocabulary cl100k base as our starting point. To enhance the performance of our model on multilingual downstream tasks, particularly in Chinese, we augment the vocabulary with commonly used Chinese characters and words, as well as those in other languages. Also, following Touvron et al. (2023a;b), we have split numbers into single digits. The final vocabulary size is approximately 152K.

The performance of the QWEN tokenizer in terms of compression is depicted in Figure 3. In this comparison, we have evaluated QWEN against several other tokenizers, including XLM-R (Conneau et al., 2019), LLaMA (Touvron et al., 2023a), Baichuan (Inc., 2023a), and InternLM (InternLM Team, 2023). Our findings reveal that QWEN achieves higher compression efficiency than its competitors in most languages. This implies that the cost of serving can be significantly reduced since a smaller number of tokens from QWEN can convey more information than its competitors. Furthermore, we have conducted preliminary experiments to ensure that scaling the vocabulary size of QWEN does not negatively impact the downstream performance of the pretrained model. Despite the increase in vocabulary size, our experiments have shown that QWEN maintains its performance levels in downstream evaluation.

  • Tokenizer: The final vocabulary size is approximately 152K.
    • Tokenizer는 학습 효율과 downstream 작업 수행에 큰 영향을 미침.
    • Qwen은 GPT-3.5와 GPT-4에 이어 바이트 쌍 인코딩(BPE)을 토큰화 방법으로 활용하였으며, 빠른 BPE 토큰화 기법인 오픈소스 틱토큰(tiktoken, Jain, 2022)을 사용하고, 어휘 cl100k을 Base Tokenizer로 사용하였음.
    • 다국어 downstream 작업, 특히 중국어에서 모델의 성능을 향상시키기 위해 다른 언어뿐만 아니라 일반적으로 사용되는 한자와 단어로 어휘를 보강하였고, Touvron 외(2023a;b)에 따라 숫자를 한 자리 수로 분할하였음.
  • 비교에서는 XLM-R(Conneau 외., 2019), LLaMA(Touvron 외., 2023a), Baichuan(Inc., 2023a), InternLM(InternLM 팀, 2023) 등 여러 다른 토큰화 기법과 비교하여 QWEN을 평가하였고, QWEN은 대부분의 언어에서 경쟁사보다 높은 압축 효율을 달성하였음.
    • 이는 경쟁사보다 적은 수의 토큰으로 더 많은 정보를 전달할 수 있기 때문에 서비스 비용을 절감할 수 있음을 의미함.
    • QWEN의 어휘 크기를 확장해도 pre-training된 모델의 downstream 성능에 부정적인 영향을 미치지 않는지 확인하기 위해 예비 실험을 수행하였는데, 어휘 크기가 증가했음에도 불구하고 QWEN은 downstream 평가에서 기존 성능 수준을 유지하는 것으로 나타났음.

2.3 ARCHITECTURE

QWEN is designed using a modified version of the Transformer architecture. Specifically, we have adopted the recent open-source approach of training large language models, LLaMA (Touvron et al., 2023a), which is widely regarded as the top open-source LLM. Our modifications to the architecture include:

  • Embedding and output projection. Based on preliminary experimental findings, we have opted for the untied embedding approach instead of tying the weights of input embedding and output projection. This decision was made in order to achieve better performance with the price of memory costs.
  • Positional embedding. We have chosen RoPE (Rotary Positional Embedding) (Su et al., 2021) as our preferred option for incorporating positional information into our model. RoPE has been widely adopted and has demonstrated success in contemporary large language models, notably PaLM (Chowdhery et al., 2022; Anil et al., 2023) and LLaMA (Touvron et al., 2023a;b). In particular, we have opted to use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
  • Bias. For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
  • Pre-Norm & RMSNorm. In modern Transformer models, pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization. Recent research has suggested alternative methods for better training stability, which we plan to explore in future versions of our model. Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm (Jiang et al., 2023). This change has resulted in equivalent performance while also improving efficiency.
  • Activation function. We have selected SwiGLU (Shazeer, 2020) as our activation function, a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al., 2017). Our initial experiments have shown that activation functions based on GLU generally outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016). As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8/3 of the hidden size.
Model # of Params Hidden size Heads Layers Learning rate Batch size Training tokens
1.8B 1.8B 2048 16 24 3.0 × 10⁻⁴ 4M 2.2T
7B 7B 4096 32 32 3.0 × 10⁻⁴ 4M 2.4T
14B 14B 5120 40 40 3.0 × 10⁻⁴ 4M 3.0T


구분 사용된 요소 설명
Embedding and output projection 언타이딩 임베딩 방식 선택 입력 임베딩과 출력 투영의 가중치를 묶지 않고 언타이딩 임베딩 방식을 선택함.
Positional embedding RoPE(Rotary Positional Embedding) 선택 위치 정보를 모델에 통합하기 위해 RoPE를 선택하고 FP32 precision을 사용함.
Bias QKV 관심 레이어에 편향 추가 대부분의 레이어에서 편향을 제거하지만 QKV 관심 레이어에 편향을 추가하여 외삽 능력을 향상시킴.
Pre-Norm & RMSNorm RMSNorm으로 레이어 정규화 대체 사전 정규화(pre-normalization) 및 RMSNorm으로 기존의 레이어 정규화 기법을 대체함.
Activation function SwiGLU(Swish + GLU) 선택 SwiGLU 활성화 함수를 선택하고 FFN의 크기를 hidden size의 8/3로 줄임.
Training Flash Attention 사용 Flash Attention을 사용하여 계산 효율성을 개선하고 메모리 사용량을 줄임.
Learning Rate Cosine Learning Rate Schedule 코사인 학습률 스케줄을 사용하며 피크 학습률 및 최소 학습률 설정을 적용함.
Optimizer AdamW 사용 AdamW 최적화 도구를 표준으로 채택함. Beta 값 및 최적화 설정이 제공됨.
Training Precision BFloat16 Mixed Precision BFloat16 mixed precision을 사용하여 훈련 안정성을 개선함.
  • Embedding and output projection.
    • 예비 실험 결과에 따라 입력 임베딩과 출력 투영의 가중치를 묶는 대신 메모리 비용 대비 더 나은 성능을 확보하기 위해 언타이딩 임베딩 방식을 선택하였음. achieve better performance with the price of memory costs
  • Positional embedding.
    • 위치 정보를 모델에 통합하기 위해 선호하는 옵션으로 RoPE(Rotary Positional Embedding)(Su et al., 2021)를 선택하였음.
    • RoPE는 널리 채택되어 현대의 대규모 언어모델, 특히 PaLM(Chowdhery 외., 2022; Anil 외., 2023) 및 LLaMA(Touvron 외., 2023a;b)에서 성공을 입증된 바 있는데, 모델 성능에 무게를 두고 더 높은 정확도를 달성하기 위해 inverse frequency matrix에 BF16 또는 FP16이 아닌 FP32 precision를 사용하였음.
  • Bias.
    • 대부분의 레이어에서 Chowdhery 등(2022)에 따라 편향을 제거하지만, 모델의 외삽 능력을 향상시키기 위해 QKV 관심 레이어에 편향을 추가했음. (Su, 2023b).
  • Pre-Norm & RMSNorm.
    • 최신 트랜스포머 모델에서는 사전 정규화(pre-normalization)가 가장 널리 사용되는 접근 방식으로, 사후 정규화(post-normalization)에 비해 훈련 안정성을 개선하는 것으로 나타났음. (최근 연구에 따르면 훈련 안정성을 개선하기 위한 대체 방법이 제안되었으며, 향후 모델 버전에서 이를 검토할 계획이라고 함.)
    • 또한, Ba et al., 2016에 설명된 기존의 레이어 정규화 기법을 RMSNorm으로 대체했으며,(Jiang et al., 2023) 이런 변경을 통해 성능은 동일하게 유지하면서 효율적으로 학습할 수 있음을 확인함.
  • Activation function..
    • 활성화 함수로 Swish(Ramachandran et al., 2017)와 게이트 선형 유닛(Dauphin et al., 2017)의 조합인 SwiGLU(Shazeer, 2020)를 선택했음.
    • 초기 예비 실험에 따르면 GLU를 기반으로 한 활성화 함수가 일반적으로 GeLU와 같은 다른 기준 옵션보다 성능이 우수한 것으로 나타났으며(Hendrycks & Gimpel, 2016), 이전 연구에서 흔히 볼 수 있는 것처럼, feed-forward network(FFN)의 크기를 hidden size의 4배에서 hidden size의 8/3로 줄였음

2.4 TRAINING

To train QWEN, we follow the standard approach of autoregressive language modeling, as described in Radford et al. (2018). This involves training the model to predict the next token based on the context provided by the previous tokens. We train models with context lengths of 2048. To create batches of data, we shuffle and merge the documents, and then truncate them to the specified context lengths. To improve computational efficiency and reduce memory usage, we employ Flash Attention in the attention modules (Dao et al., 2022). We adopt the standard optimizer AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) for pretraining optimization. We set the hyperparameters β1 = 0.9, β2 = 0.95, and ϵ = 10⁻⁸. We use a cosine learning rate schedule with a specified peak learning rate for each model size. The learning rate is decayed to a minimum learning rate of 10% of the peak learning rate. All the models are trained with BFloat16 mixed precision for training stability.

  • Radford 외(2018)에 설명된 대로 자동 회귀 언어 모델링의 표준 접근 방식을 따랐고, 이전 토큰이 제공한 문맥을 기반으로 다음 토큰을 예측하도록 모델을 훈련하는 것이 포함됨.
  • Qwen은 2048개의 컨텍스트 길이로 모델을 훈련하고, 데이터 배치를 생성하기 위해 문서를 섞고 병합한 다음 지정된 컨텍스트 길이로 자른 뒤, 계산 효율성을 개선하고 메모리 사용량을 줄이기 위해 attention module에 flash attention을 사용했음. (Dao et al., 2022)
  • 사전 훈련 최적화를 위해 표준 최적화 도구인 AdamW(Kingma & Ba, 2014; Loshchilov & Hutter, 2017)를 채택하고, 하이퍼파라미터는 β1 = 0.9, β2 = 0.95, ϵ = 10-⁸로 설정하였음.
  • 각 모델 크기에 대해 지정된 피크 학습률이 있는 코사인 학습률 스케줄을 사용하였으며, 학습률은 피크 학습률의 10%의 최소 학습률로 감쇠되었고, 모든 모델은 훈련 안정성을 위해 BFloat16 mixed precision로 훈련했다고 함.

2.5 CONTEXT LENGTH EXTENSION

Transformer models have a significant limitation in terms of the context length for their attention mechanism. As the context length increases, the quadratic-complexity computation leads to a drastic increase in both computation and memory costs. In this work, we have implemented simple training-free techniques that are solely applied during inference to extend the context length of the model. One of the key techniques we have used is NTK-aware interpolation (bloc97, 2023).

  • 트랜스포머 모델은 Attention 메커니즘의 컨텍스트 길이 측면에서 상당한 제한이 있으므로, 컨텍스트 길이가 증가함에 따라 이차적 복잡성 계산(quadratic-complexity)으로 인해 계산 및 메모리 비용이 급격히 증가하는 문제가 있음.
  • 모델의 컨텍스트 길이를 확장하기 위해 인퍼런스 중에만 적용되는 간단한 훈련이 필요 없는 기법을 구현하였고, 사용한 핵심 기술 중 하나는 NTK-aware interpolation(bloc97, 2023)임.
  • 관련 레딧

Table 2: Overall performance on widely-used benchmarks compared to open-source base models. Our largest QWEN model with 14 billion parameters outperforms previous 13B SoTA models on all datasets.

Model Params MMLU C-Eval GSM8K MATH HumanEval MBPP 3-shot 5-shot BBH 3-shot
MPT 7B 30.8 27.8 47.9 51.0 54.7 35.6 44.6 23.5 51.7
Falcon 40B 47.9 57.0 51.0 62.1 59.5 47.7 58.2   53.4
ChatGLM2 7B 54.7 59.5 62.6 69.8 68.6 58.7 66.3   58.8
InternLM 13B 35.6 27.3 32.5 41.4 51.4 63.7 69.8 51.7 56.3
Baichuan2 7B 44.6 31.8     54.7 40.4 9.1 3.0 5.6
LLaMA 13B 58.2 37.5 52.6 52.8 63.5 10.6 15.2 3.1 11.6
LLAMA 2 34B 42.3 40.4     72.1 13.5 19.6 5.5 24.8
StableBeluga2 70B 54.4 42.2 63.3 69.6 28.0 69.9 32.4 6.5  
QWEN 13B 69.6 11.2 21.2 51.7 61.3 10.4 31.2 6.5 18.3

2.6 EXPERIMENTAL RESULTS

To evaluate the zero-shot and few-shot learning capabilities of our models, we conduct a thor- ough benchmark assessment using a series of datasets. We compare QWEN with the most recent open-source base models, including LLaMA (Touvron et al., 2023a), LLAMA 2 (Touvron et al., 2023b), MPT (Mosaic ML, 2023), Falcon (Almazrouei et al., 2023), Baichuan2 (Yang et al., 2023), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM Team, 2023), XVERSE (Inc., 2023b), and StableBeluga2 (Stability AI, 2023). Our evaluation covers a total of 7 popular benchmarks, which are MMLU (5-shot) (Hendrycks et al., 2020), C-Eval (5-shot) (Huang et al., 2023), GSM8K (8-shot) (Cobbe et al., 2021), MATH (4-shot) (Hendrycks et al., 2021), HumanEval (0-shot) (Chen et al., 2021), MBPP (0-shot) (Austin et al., 2021), and BBH (Big Bench Hard) (3 shot) (Suzgun et al., 2022). We aim to provide a comprehensive summary of the overall performance of our models across these benchmarks.

In this evaluation, we focus on the base language models without alignment and collect the baselines’ best scores from their official results and OpenCompass (OpenCompass Team, 2023). The results are presented in Table 2.

Our experimental results demonstrate that the three QWEN models exhibit exceptional performance across all downstream tasks. It is worth noting that even the larger models, such as LLaMA2-70B, are outperformed by QWEN-14B in 3 tasks. QWEN-7B also performs admirably, surpassing LLaMA2-13B and achieving comparable results to Baichuan2-13B. Notably, despite having a relatively small number of parameters, QWEN-1.8B is capable of competitive performance on certain tasks and even outperforms larger models in some instances. The findings highlight the impressive capabilities of the QWEN models, particularly QWEN-14B, and suggest that smaller models, such as QWEN-1.8B, can still achieve strong performance in certain applications.

To evaluate the effectiveness of context length extension, Table 3 presents the test results on arXiv3 in terms of perplexity (PPL). These results demonstrate that by combining NTK-aware interpolation, LogN-Scaling, and layer-wise window assignment, we can effectively maintain the performance of our models in the context of over 8192 tokens.

  • 모델의 zero-shot and few-shot learning capabilities를 평가하기 위해 일련의 데이터셋를 사용하여 철저한 벤치마크 평가를 수행하였음.
    • 3개의 태스크에서 QWEN-14B가 LLaMA2-70B와 같은 더 큰 모델보다 성능이 뛰어나다는 점을 확인하였고,
    • QWEN-7B 역시 우수한 성능을 발휘하여, 일부 태스크에서 LLaMA2-13B를 능가하고 바이촨2-13B와 비슷한 결과를 달성했음.
    • 특히, 상대적으로 적은 수의 파라미터를 가지고 있음에도 불구하고 QWEN-1.8B는 특정 작업에서 경쟁력 있는 성능을 발휘하며 일부 경우 더 큰 모델을 능가하기도 하였음.
    • 이 결과는 QWEN 모델, 특히 QWEN-14B의 인상적인 성능을 강조하며, QWEN-1.8B와 같은 소규모 모델도 특정 애플리케이션에서 여전히 우수한 성능을 달성할 수 있음을 시사함
  • 컨텍스트 길이 확장의 효과를 평가하기 위해 표 3은 난해성(PPL) 측면에서 arXiv3에 대한 테스트 결과를 제시하였는데,
    • 이 결과는 NTK-aware interpolation, LogN-Scaling, and layer-wise window assignment을 결합하여 8192개 이상의 토큰 컨텍스트에서 모델의 성능을 효과적으로 유지할 수 있음을 보여줌.
    • SWA와 GQA와 비교할 필요가 있어보임, Grouped-query attention (GQA) for faster inference; Uses Sliding Window Attention (SWA)
    • SWA + GQA vs NTK-aware interpolation + LogN-Scaling + layer-wise window assignment


Table 3: Results of QWEN on long-context inference using various techniques. Our experimental findings reveal that the application of our crucial techniques enables the model to consistently achieve low perplexity as the context length increases. This suggests that these techniques play a significant role in enhancing the model’s ability to comprehend and generate lengthy texts.

3 ALIGNMENT

Pretrained large language models have been found to be not aligned with human behavior, making them unsuitable for serving as AI assistants in most cases. Recent research has shown that the use of alignment techniques, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly improve the ability of language models to engage in natural conversation. In this section, we will delve into the details of how QWEN models have been trained using SFT and RLHF, and evaluate their performance in the context of chat-based assistance.

  • pre-training된 대규모 언어모델은 휴먼의 행동과 일치하지 않는 것으로 밝혀져 대부분의 경우 AI 비서로 사용하기에 부적합한 것으로 확인했다고 함.
    • 선행 연구에서와 동일하게 SFT와 RLHF를 사용하여 QWEN 모델을 학습시킴.

3.1 SUPERVISED FINETUNING

To gain an understanding of human behavior, the initial step is to carry out SFT, which finetunes a pretrained LLM on chat-style data, including both queries and responses. In the following sections, we will delve into the details of data construction and training methods.

  • The dataset contains academic papers from https://arxiv.org.

3.1.1 DATA

To enhance the capabilities of our supervised fine-tuning datasets, we have annotated conversations in multiple styles. While conventional datasets (Wei et al., 2022a) contain a vast amount of data prompted with questions, instructions, and answers in natural language, our approach takes it a step further by annotating human-style conversations. This practice, inspired by Ouyang et al. (2022), aims at improving the model’s helpfulness by focusing on natural language generation for diverse tasks. To ensure the model’s ability to generalize to a wide range of scenarios, we specifically excluded data formatted in prompt templates that could potentially limit its capabilities. Furthermore, we have prioritized the safety of the language model by annotating data related to safety concerns such as violence, bias, and pornography.

In addition to data quality, we have observed that the training method can significantly impact the final performance of the model. To achieve this, we utilized the ChatML-style format (OpenAI, 2022), which is a versatile meta language capable of describing both the metadata (such as roles) and the content of a turn. This format enables the model to effectively distinguish between various types of information, including system setup, user inputs, and assistant outputs, among others. By leveraging this approach, we can enhance the model’s ability to accurately process and analyze complex conversational data.


  • 다양한 스타일로 대화에 주석을 달았고, 기존 데이터셋(Wei 외, 2022a)에는 자연어로 질문, 지침, 답변이 제시된 방대한 양의 데이터가 포함되어 있지만, Qwen의 접근 방식은 휴먼 스타일의 대화에 주석을 달아서 기존 연구보다 한 단계 더 나아갔음.
    • 이 방식은 Ouyang 외(2022)의 연구에서 영감을 얻은 것으로, 다양한 작업을 위한 자연어 생성에 집중함으로써 모델의 유용성(helpfulness)을 향상시키는 것을 목표로 하였음.
    • 다양한 시나리오에 일반화할 수 있는 모델의 능력을 보장하기 위해, 특히 모델의 기능을 제한할 수 있는 프롬프트 템플릿으로 형식화된 데이터는 제외했으며,
    • 폭력, 편견, 음란물 등 안전 문제와 관련된 데이터에 주석을 달아 언어 모델의 안전성을 우선시하였음.
  • 데이터 품질 외에도 학습 방법이 모델의 최종 성능에 큰 영향을 미칠 수 있음을 확인
    • 이를 위해 메타데이터(e.g., roles)와 턴의 내용을 모두 설명할 수 있는 다목적 메타 언어인 ChatML 스타일 형식(OpenAI, 2022)을 활용했으며, 이 형식을 통해 모델은 시스템 설정, 사용자 입력, 어시스턴트 출력 등 다양한 유형의 정보를 효과적으로 구분할 수 있었음.
    • 이 접근 방식을 활용하면 복잡한 대화 데이터를 정확하게 처리하고 분석하는 모델의 능력을 향상시킬 수 있음.


openai-python/public/chatml.md

[
 {"token": "<|im_start|>"},
 "system\nYou are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "user\nHow are you",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "assistant\nI am doing well!",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "user\nHow are you now?",
 {"token": "<|im_end|>"}, "\n"
]

3.1.2 TRAINING

Consistent with pretraining, we also apply next-token prediction as the training task for SFT. We apply the loss masks for the system and user inputs. More details are demonstrated in Section A.1.1.

The model’s training process utilizes the AdamW optimizer, with the following hyperparameters: β1 set to 0.9, β2 set to 0.95, and ϵ set to 10−8. The sequence length is limited to 2048, and the batch size is 128. The model undergoes a total of 4000 steps, with the learning rate gradually increased over the first 1430 steps, reaching a peak of 2 × 10−6. To prevent overfitting, weight decay is applied with a value of 0.1, dropout is set to 0.1, and gradient clipping is enforced with a limit of 1.0.

항목 설명
훈련 작업 SFT(모델)의 훈련 작업으로 다음 토큰 예측을 적용
손실 마스크 시스템 및 사용자 입력에 대한 손실 마스크 적용
옵티마이저 AdamW 옵티마이저 사용
하이퍼파라미터 - β1: 0.9, β2: 0.95, ϵ: 10^-8
시퀀스 길이 2048로 제한
배치 크기 128
총 스텝 수 4000
학습률 스케줄 처음 1430 스텝 동안 학습률 점진적으로 증가 (최대값: 2 × 10^-6)
과적합 방지 - 가중치 감소(weight decay): 0.1
드롭아웃 0.1
그래디언트 클리핑 1.0 제한

3.2 REINFORCEMENT LEARNING FROM HUMAN FEEDBACK

While SFT has proven to be effective, we acknowledge that its generalization and creativity capabilities may be limited, and it is prone to overfitting. To address this issue, we have implemented Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human preferences, following the approaches of Ouyang et al. (2022); Christiano et al. (2017). This process involves training a reward model and using Proximal Policy Optimization (PPO) (Schulman et al., 2017) to conduct policy training.

  • SFT의 효과는 입증되었지만, 일반화 및 창의성 기능이 제한될 수 있으며 과적합이 발생하기 쉽다는 단점이 있음.
  • 이 문제를 해결하기 위해 Facebook은 Ouyang 외(2022), Christiano 외(2017)의 접근 방식에 따라 휴먼의 피드백을 통한 강화 학습(RLHF)을 구현하여 SFT 모델을 휴먼의 선호도에 더욱 잘 맞출 수 있도록 했으며, 이 과정에서 보상 모델을 훈련하고 근사 정책 최적화(PPO)를 사용하여 정책 훈련을 수행함(Schulman et al., 2017).

3.2.1 REWARD MODEL

To create a successful reward model, like building a large language model (LLM), it is crucial to first undergo pretraining and then fine-tuning. This pretraining process, also known as preference model pretraining (PMP) (Bai et al., 2022b), necessitates a vast dataset of comparison data. This dataset consists of sample pairs, each containing two distinct responses for a single query and their corresponding preferences. Similarly, fine-tuning is also conducted on this type of comparison data, but with a higher quality due to the presence of quality annotations.

  • 대규모 언어모델(LLM)을 구축하는 것과 같이 성공적인 보상 모델을 만들려면, 먼저 pre-training(pretraining)을 거친 다음 파인튜닝(fine-tuning)을 하는 것이 중요함.
  • Preference Model Pretraining (PMP) (Bai et al., 2022b)이라고도 하는 이 사전 훈련 프로세스에는 방대한 비교 데이터셋가 필요하며, 이 데이터셋은 단일 쿼리에 대한 두 개의 서로 다른 응답과 그에 상응하는 선호도를 포함하는 샘플 쌍으로 구성됨.
  • 마찬가지로 이런 유형의 비교 데이터에 대해서도 파인튜닝이 수행되지만, 이미 품질 주석이 있기 때문에 더 나은 개선을 보임.

참고*: DPO(Direct Preference Optimization)에서는 이런 데이터셋으로 한 번에 RLHF를 수행하고, 제피르(Zephyr) 연구 팀에서도 직관을 확인하였다고 보고 하였음. Hugging Face Co-founder Thomas Wolf 링크드인 포스트 참고

Creating annotated prompt pair dataset During the fine-tuning phase, we gather a variety of prompts and adjust the reward model based on human feedback for responses from the QWEN models. To ensure the diversity and complexity of user prompts are properly taken into account, we have created a classification system with around 6600 detailed tags and implemented a balanced sampling algorithm that considers both diversity and complexity when selecting prompts for annotation by the reward model (Lu et al., 2023). To generate a wide range of responses, we have utilized QWEN models of different sizes and sampling strategies, as diverse responses can help reduce annotation difficulties and enhance the performance of the reward model. These responses are then evaluated by annotators following a standard annotation guideline, and comparison pairs are formed based on their scores.

  • 사용자 프롬프트의 다양성과 복잡성을 적절히 고려하기 위해 약 6600개의 세부 태그가 포함된 분류 모델 시스템 생성하였고,
  • 보상 모델에 주석을 달 프롬프트를 선택할 때 다양성과 복잡성을 모두 고려하는 implemented a balanced sampling algorithm that considers both diversity and complexity when selecting prompts for annotation by the reward model (Lu et al., 2023) 사용하였음.
  • 다양한 응답을 생성하기 위해 다양한 크기와 샘플링 전략을 가진 QWEN 모델을 활용하였음. (여러 크기의 Qwen 모델로 다양한 샘플링 전략 사용하여 여러 응답을 생성하였음.)
  • 최종적으로 표준 주석 가이드라인에 따라 주석가가 평가하고 점수에 따라 비교 쌍을 형성하였음.

Creating the reward model In creating the reward model, we utilize the same-sized pre-trained language model QWEN to initiate the process. It is important to mention that we have incorporated a pooling layer into the original QWEN model to extract the reward for a sentence based on a specific end token. The learning rate for this process has been set to a constant value of 3 × 10−6, and the batch size is 64. Additionally, the sequence length is set to 2048, and the training process lasts for a single epoch.

  • 풀링 레이어를 원래의 풀링 레이어에 통합하여 풀링 레이어를 통합하여 특정 엔드 토큰을 기반으로 문장에 대한 보상을 추출하였음. (The learning rate (constant) = 3 x 10^(-6), batch size = 65, sequence length = 2048)

We adopted the accuracy on the test dataset as an important but not exclusive evaluation metric for the reward model. In Table 4, we report the test pairwise accuracy of PMP and reward models on diverse human preference benchmark datasets (Bai et al., 2022b; Stiennon et al., 2020; Ethayarajh et al., 2022; Lightman et al., 2023). Specifically, QWEN Helpful-base and QWEN Helpful-online are our proprietary datasets. The responses in QWEN Helpful-base are generated from QWEN without RLHF, whereas QWEN Helpful-online includes responses from QWEN with RLHF. The results show that the PMP model demonstrates high generalization capabilities on out-of-distribution data, and the reward model demonstrates significant improvement on our QWEN reward datasets.

  • 데이터셋의 정확도를 보상 모델에 대한 중요하지만 배타적인 평가 지표로 채택하고, 표 4에서는 다양한 휴먼 선호(human preference)도 벤치마크 데이터셋에 대한 PMP와 보상 모델의 테스트 쌍별 정확도를 보고하였으며, (Bai et al., 2022b, Stiennon et al., 2020, Ethayarajh et al., 2022, Lightman et al., 2023)
  • 그 중에서 QWEN Helpful-base, QWEN Helpful-online 데이터셋은 독자적인 데이터셋
    • QWEN Helpful-base: RLHF가 없는 Qwen의 응답을 사용한 데이터셋
    • QWEN Helpful-online: RLHF가 있는 Qwen의 응답을 사용한 데이터셋
  • PMP 모델은 학습되지 않은 데이터에 대해 높은 일반화 능력을 보여주었고, Reward 모델은 QWEN 보상 데이터셋에서 상당한 개선이 이루어졌음을 보여주었음.
  • 소결*: PMP로 일반화, Reward로 휴먼이 선호하는 답변에 대해서 반환할 수 있음. PMP와 RLHF 모두 유효하다고 보고함.

3.2.2 REINFORCEMENT LEARNING

Our Proximal Policy Optimization (PPO) process involves four models: the policy model, value model, reference model, and reward model. Before starting the PPO procedure, we pause the policy model’s updates and focus solely on updating the value model for 50 steps. This approach ensures that the value model can adapt to different reward models effectively.

During the PPO operation, we use a strategy of sampling two responses for each query simultaneously. This strategy has proven to be more effective based on our internal benchmarking evaluations. We set the KL divergence coefficient to 0.04 and normalize the reward based on the running mean. The policy and value models have learning rates of 1 × 10−6 and 5 × 10−6, respectively. To enhance training stability, we utilize value loss clipping with a clip value of 0.15. For inference, the policy top-p is set to 0.9. Our findings indicate that although the entropy is slightly lower than when top-p is set to 1.0, there is a faster increase in reward, ultimately resulting in consistently higher evaluation rewards under similar conditions.

Additionally, we have implemented a pretrained gradient to mitigate the alignment tax. Empirical findings indicate that, with this specific reward model, the KL penalty is adequately robust to counteract the alignment tax in benchmarks that are not strictly code or math in nature, such as those that test common sense knowledge and reading comprehension. It is imperative to utilize a significantly larger volume of the pretrained data in comparison to the PPO data to ensure the effectiveness of the pretrained gradient. Additionally, our empirical study suggests that an overly large value for this coefficient can considerably impede the alignment to the reward model, eventually compromising the ultimate alignment, while an overly small value would only have a marginal effect on alignment tax reduction.

  • PPO: (1) the policy model, (2) value model, (3) reference model, (4) reward model.
  • PPO 운영 중에는 각 쿼리에 대해 동시에 두 개의 응답을 샘플링하는 전략을 사용하였는데, 내부 벤치마킹 평가에 따라 더 효과적인 것으로 입증되었음.
Parameter/Setting Value
KL Divergence Coefficient 0.04
Reward Normalization Running mean
Policy Model Learning Rate 1 x 10^(-6)
Value Model Learning Rate 5 x 10^(-6)
Value Loss Clipping Clip value of 0.15
Inference Policy Top-p 0.9
Entropy Comparison Slightly lower at 0.9, but faster reward increase leading to consistently higher evaluation rewards under similar conditions.
  • alignment tax를 완화하기 위해 pretrained gradient를 구현하였는데,
    • 경험적 결과에 따르면 이 특정 보상 모델을 사용하면 상식 지식과 독해력을 테스트하는 벤치마크와 같이 엄격하게 코드나 수학이 아닌 벤치마크에서 KL 페널티가 alignment tax를 잘 조정할 수 있으므로 적절하였다고 함.
    • *pre-training된 그라데이션의 효과를 보장하기 위해서는 PPO 데이터에 비해 훨씬 더 많은 양의 pre-training된 데이터를 활용하는 것이 필수적이고, 경험적 연구에 따르면 이 계수의 값이 지나치게 크면 보상 모델에 대한 정렬을 상당히 방해하여 결국 궁극적인 정렬을 손상시킬 수 있으며, 지나치게 작은 값은 alignment tax 감소에 미미한 효과만 미칠 수 있음.

3.3 AUTOMATIC AND HUMAN EVALUATION OF ALIGNED MODELS

To showcase the effectiveness of our aligned models, we conduct a comparison with other aligned models on well-established benchmarks, including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021), and BBH (Suzgun et al., 2022). Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting to demonstrate how well the models follow instructions. The prompt in a zero-shot setting consists of an instruction and a question without any previous examples in the context. The results of the baselines are collected from their official reports and OpenCompass (OpenCompass Team, 2023).

The results in Table 5 demonstrate the effectiveness of our aligned models in understanding human instructions and generating appropriate responses. QWEN-14B-Chat outperforms all other models except ChatGPT (OpenAI, 2022) and LLAMA 2-CHAT-70B (Touvron et al., 2023b) in all datasets, including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021), and BBH (Suzgun et al., 2022). In particular, QWEN’s performance in HumanEval, which measures the quality of generated codes, is significantly higher than that of other open-source models.

Moreover, QWEN’s performance is consistently better than that of open-source models of similar size, such as LLaMA2 (Touvron et al., 2023b), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM Team, 2023), and Baichuan2 (Yang et al., 2023). This suggests that our alignment approach, which involves fine-tuning the model on a large dataset of human conversations, has been effective in improving the model’s ability to understand and generate human-like language.

Despite this, we have reservations about the ability of traditional benchmark evaluation to accurately measure the performance and potential of chat models trained with alignment techniques in today’s landscape. The results mentioned earlier provide some evidence of our competitive standing, but we believe that it is crucial to develop new evaluation methods specifically tailored to aligned models.

We believe that human evaluation is crucial, which is why we have created a carefully curated dataset for this purpose. Our process involved collecting 300 instructions in Chinese that covered a wide range of topics, including knowledge, language understanding, creative writing, coding, and mathematics. To evaluate the performance of different models, we chose the SFT version of QWEN-CHAT-7B and the SFT and RLHF versions of QWEN-CHAT-14B, and added two strong baselines, GPT-3.5 and GPT-44, for comparison. For each instruction, we asked three annotators to rank the model responses by the overall score of helpfulness, informativeness, validity, and other relevant factors. Our dataset and evaluation methodology provides a comprehensive and rigorous assessment of the capabilities of different language models in various domains.

Figure 4 illustrates the win rates of the various models. For each model, we report the percentage of wins, ties, and losses against GPT-3.5, with the segments of each bar from bottom to top representing these statistics. The experimental results clearly demonstrate that the RLHF model outperforms the SFT models by significant margins, indicating that RLHF can encourage the model to generate responses that are more preferred by humans. In terms of overall performance, we find that the RLHF model significantly outperforms the SFT models, falling behind GPT-4. This indicates the effectiveness of RLHF for aligning to human preference. To provide a more comprehensive understanding of the models’ performance, we include a case study with examples from different models in Appendix A.2.2. Nonetheless, it remains difficult to accurately capture the gap between our models and the proprietary models. As such, a more extensive and rigorous assessment is required for the chat models. To obtain the results from the models, we use the OpenAI APIs of GPT-3.5-turbo-0613 and GPT-4-0613.

  • MMLU(Hendrycks 외, 2020), C-Eval(Huang 외, 2023), GSM8K(Cobbe 외, 2021), HumanEval(Chen 외, 2021), BBH(Suzgun 외, 2022)와 같은 유명한 벤치마크도 사용하고, few shot 벤치 마크 이외에도 instruction에 얼마나 잘 작동하는지 확인하기 위해 zero-shot 테스트도 진행하였고, OpenCompass의 공식보고서 등을 통해 base line을 수집하였음.
  • QWEN-14B-Chat은 MMLU(Hendrycks 외, 2020), C-Eval(Huang 외, 2023), GSM8K(Cobbe 외, 2021), HumanEval(Chen 외, 2021), BBH(Suzgun 외, 2022) 등 모든 데이터셋에서 ChatGPT(OpenAI, 2022)와 LLAMA 2-CHAT-70B(Touvron 외, 2023b)를 제외한 모든 다른 모델보다 성능이 뛰어남.
  • Qwen팀도 오늘날의 환경에서 정렬 기법으로 학습된 채팅 모델의 성능과 잠재력을 정확하게 측정할 수 있는 기존 벤치마크 평가의 능력에 대해서는 의구심이 있으며, 앞서 언급한 결과는 당사의 경쟁력에 대한 일부 증거를 제공하지만, 얼라인먼트 모델에 특화된 새로운 평가 방법을 개발하는 것이 중요하다고 생각하며, 사람의 정성적인 평가가 필수적이라고 생각하였다고 함.
    • 이 과정에서 지식, 언어 이해, 창의적 글쓰기, 코딩, 수학 등 다양한 주제를 다루는 300개의 중국어로 된 instruction을 수집하였고, 다양한 모델의 성능을 평가하기 위해 QWEN-CHAT-7B의 SFT 버전과 QWEN-CHAT-14B의 SFT 및 RLHF 버전을 선택했으며, 비교를 위해 두 가지 우수한 베이스 라인인 GPT-3.5와 GPT-4를 추가하였는데,
    • 각 지침에 대해 세 명의 주석가에게 유용성, 정보성, 타당성 및 기타 관련 요소의 전체 점수에 따라 모델 응답의 순위를 매기도록 요청하였음.
    • 당사의 데이터셋와 평가 방법은 다양한 영역에서 다양한 언어 모델의 기능에 대한 포괄적이고 엄격한 평가를 제공함.
      • 실험 결과는 RLHF 모델이 SFT 모델을 큰 차이로 능가한다는 것을 명확하게 보여주며, 이는 RLHF가 모델이 휴먼이 더 선호하는 응답을 생성하도록 유도할 수 있음을 나타냄
      • 전반적인 성능 측면에서는 RLHF 모델이 SFT 모델을 앞지르며 GPT-4보다 뒤처지는 것으로 나타남.
      • 이는 RLHF가 휴먼의 선호도에 부합하는 데 효과적이라는 것을 나타내고, 모델의 성능에 대한 보다 포괄적인 이해를 돕기 위해 Appendix A.2.2에 다양한 모델의 예시가 포함된 사례 연구를 포함하였음.
      • 그럼에도 불구하고 Qwen 모델과 독점 모델(Chat GPT) 간의 차이를 정확하게 파악하는 것은 여전히 어려웠으며, 채팅 모델에 대한 보다 광범위하고 엄격한 평가가 추가로 필요하다고 언급함.

3.4 TOOL USE, CODE INTERPRETER, AND AGENT

The QWEN models, which are designed to be versatile, have the remarkable ability to assist with(semi-)automating daily tasks by leveraging their skills in tool-use and planning. As such, they can serve as agents or copilots to help streamline various tasks. We explore QWEN’s proficiency in the following areas:

  • Utilizing unseen tools through ReAct prompting (Yao et al., 2022) (see Table 6).
  • Using a Python code interpreter to enhance math reasoning, data analysis, and more (see Table 7 and Table 8).
  • Functioning as an agent that accesses Hugging Face’s extensive collection of multimodal models while engaging with humans (see Table 9).

To enhance QWEN’s capabilities as an agent or copilot, we employ the self-instruct (Wang et al., 2023c) strategy for SFT. Specifically, we utilize the in-context learning capability of QWEN for self-instruction. By providing a few examples, we can prompt QWEN to generate more relevant queries and generate outputs that follow a specific format, such as ReAct (Yao et al., 2022). We then apply rules and involve human annotators to filter out any noisy samples. Afterwards, the samples are incorporated into QWEN’s training data, resulting in an updated version of QWEN that is more dependable for self-instruction. We iterate through this process multiple times until we gather an ample number of samples that possess both exceptional quality and a wide range of diversity. As a result, our final collection consists of around 2000 high-quality samples.

During the fine-tuning process, we mix these high-quality samples with all the other general-purpose SFT samples, rather than introducing an additional training stage. By doing so, we are able to retain essential general-purpose capabilities that are also pertinent for constructing agent applications.

Using Tools via ReAct Prompting We have created and made publicly available a benchmark for evaluating QWEN’s ability to call plugins, tools, functions, or APIs using ReAct Prompting (see Qwen Team, Alibaba Group, 2023b). To ensure fair evaluation, we have excluded any plugins that were included in QWEN’s training set from the evaluation set. The benchmark assesses the model’s accuracy in selecting the correct plugin from a pool of up to five candidates, as well as the plausibility of the parameters passed into the plugin and the frequency of false positives. In this evaluation, a false positive occurs when the model incorrectly invokes a plugin in response to a query, despite not being required to do so.

The results presented in Table 6 demonstrate that QWEN consistently achieves higher accuracy in identifying the relevance of a query to the available tools as the model size increases. However, the table also highlights that beyond a certain point, there is little improvement in performance when it comes to selecting the appropriate tool and providing relevant arguments. This suggests that the current preliminary benchmark may be relatively easy and may require further enhancement in future iterations. It is worth noting that GPT-3.5 stands out as an exception, displaying suboptimal performance on this particular benchmark. This could potentially be attributed to the fact that the benchmark primarily focuses on the Chinese language, which may not align well with GPT-3.5’s capabilities. Additionally, we observe that GPT-3.5 tends to attempt to use at least one tool, even if the query cannot be effectively addressed by the provided tools.

Using Code Interpreter for Math Reasoning and Data Analysis The Python code interpreter is widely regarded as a powerful tool for augmenting the capabilities of an LLM agent. It is worth investigating whether QWEN can harness the full potential of this interpreter to enhance its performance in diverse domains, such as mathematical reasoning and data analysis. To facilitate this exploration, we have developed and made publicly available a benchmark that is specifically tailored for this purpose (see Qwen Team, Alibaba Group, 2023a).

The benchmark encompasses three primary categories of tasks: math problem-solving, data visualization, and other general-purpose tasks like file post-processing and web crawling. Within the visualization tasks, we differentiate between two levels of difficulty. The easier level can be achieved by simply writing and executing a single code snippet without the need for advanced planning skills. However, the more challenging level requires strategic planning and executing multiple code snippets in a sequential manner. This is because the subsequent code must be written based on the output of the previous code. For example, an agent may need to examine the structure of a CSV file using one code snippet before proceeding to write and execute additional code to create a plot.

Regarding evaluation metrics, we consider both the executability and correctness of the generated code. To elaborate on the correctness metrics, for math problems, we measure accuracy by verifying if the ground truth numerical answer is present in both the code execution result and the final response. When it comes to data visualization, we assess accuracy by utilizing QWEN-VL (Bai et al., 2023), a powerful multimodal language model. QWEN-VL is capable of answering text questions paired with images, and we rely on it to confirm whether the image generated by the code fulfills the user’s request.

The results regarding executability and correctness are presented in Table 7 and Table 8, respectively. It is evident that CODE LLAMA generally outperforms LLAMA 2, its generalist counterpart, which is not surprising since this benchmark specifically requires coding skills. However, it is worth noting that specialist models that are optimized for code synthesis do not necessarily outperform generalist models. This is due to the fact that this benchmark encompasses various skills beyond coding, such as abstracting math problems into equations, understanding language-specified constraints, and responding in the specified format such as ReAct. Notably, QWEN-7B-CHAT and QWEN-14B-CHAT surpass all other open-source alternatives of similar scale significantly, despite being generalist models.

Serving as a Hugging Face Agent Hugging Face provides a framework called the Hugging Face Agent or Transformers Agent (Hugging Face, 2023), which empowers LLM agents with a curated set of multimodal tools, including speech recognition and image synthesis. This framework allows an LLM agent to interact with humans, interpret natural language commands, and employ the provided tools as needed.

To evaluate QWEN’s effectiveness as a Hugging Face agent, we utilized the evaluation benchmarks offered by Hugging Face. The results are presented in Table 9. The evaluation results reveal that QWEN performs quite well in comparison to other open-source alternatives, only slightly behind the proprietary GPT-4, demonstrating QWEN’s competitive capabilities.

4 CODE-QWEN: SPECIALIZED MODEL FOR CODING

Training on domain-specific data has been shown to be highly effective, particularly in the case of code pretraining and fine-tuning. A language model that has been reinforced with training on code data can serve as a valuable tool for coding, debugging, and interpretation, among other tasks. In this work, we have developed a series of generalist models using pretraining and alignment techniques. Building on this foundation, we have created domain-specific models for coding by leveraging the base language models of QWEN, including continued pretrained model, CODE-QWEN and supervised finetuned model, CODE-QWEN-CHAT. Both models have 14 billion and 7 billion parameters versions.

4.1 CODE PRETRAINING

We believe that relying solely on code data for pretraining can result in a significant loss of the ability to function as a versatile assistant. Unlike previous approaches that focused solely on pretraining on code data (Li et al., 2022; 2023d), we take a different approach (Rozi`ere et al., 2023) by starting with our base models QWEN trained on a combination of text and code data, and then continuing to pretrain on the code data. We continue to pretrain the models on a total of around 90 billion tokens. During the pre-training phase, we initialize the model using the base language models QWEN. Many applications that rely on specialized models for coding may encounter lengthy contextual scenarios, such as tool usage and code interpretation, as mentioned in Section 3.4. To address this issue, we train our models with context lengths of up to 8192. Similar to base model training in Section 2.4, we employ Flash Attention (Dao et al., 2022) in the attention modules, and adopt the standard optimizer AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017), setting β1 = 0.9, β2 = 0.95, and ϵ = 10−8. We set the learning rate as 6.0 × 10−5 for CODE-QWEN-14B and 3.0 × 10−5 for CODE-QWEN-7B, with 3% warm up iterations and no learning rate decays.

4.2 CODE SUPERVISED FINE-TUNING

After conducting a series of empirical experiments, we have determined that the multi-stage SFT strategy yields the best performance compared to other methods. In the supervised fine-tuning stage, the model CODE-QWEN-CHAT initialized by the code foundation model CODE-QWEN are optimized by the AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) optimizer (β1 = 0.9, β2 = 0.95, ϵ = 10−8) with a learning rate of 2.0 × 10−6 and 1.0 × 10−5 for the 14B and 7B model respectively. The learning rate increases to the peaking value with the cosine learning rate schedule (3% warm-up steps) and then remains constant.

4.3 EVALUATION

Our CODE-QWEN models have been compared with both proprietary and open-source language models, as shown in Tables 10 and 11. These tables present the results of our evaluation on the test sets of Humaneval (Chen et al., 2021), MBPP (Austin et al., 2021), and the multi-lingual code generation benchmark HUMANEVALPACK (Muennighoff et al., 2023). The comparison is based on the pass@1 performance of the models on these benchmark datasets. The results of this comparison are clearly demonstrated in Tables 10 and 11.

Our analysis reveals that specialized models, specifically CODE-QWEN and CODE-QWEN-CHAT, significantly outperform previous baselines with similar parameter counts, such as OCTOGEEX (Muennighoff et al., 2023), InstructCodeT5+ (Wang et al., 2023d), and CodeGeeX2 (Zheng et al., 2023). In fact, these models even rival the performance of larger models like Starcoder (Li et al., 2023d).

When compared to some of the extremely large-scale closed-source models, CODE-QWEN and CODE-QWEN-CHAT demonstrate clear advantages in terms of pass@1. However, it is important to note that these models fall behind the state-of-the-art methods, such as GPT-4, in general. Nonetheless, with the continued scaling of both model size and data size, we believe that this gap can be narrowed in the near future.

It is crucial to emphasize that the evaluations mentioned previously are insufficient for grasping the full extent of the strengths and weaknesses of the models. In our opinion, it is necessary to develop more rigorous tests to enable us to accurately assess our relative performance in comparison to GPT-4.

5 MATH-QWEN: SPECIALIZED MODEL FOR MATHEMATICS REASONING

We have created a mathematics-specialized model series called MATH-QWEN-CHAT, which is built on top of the QWEN pretrained language models. Specifically, we have developed assistant models that are specifically designed to excel in arithmetic and mathematics and are aligned with human behavior. We are releasing two versions of this model series, MATH-QWEN-14B-CHAT and MATH-QWEN-7B-CHAT, which have 14 billion and 7 billion parameters, respectively.

5.1 TRAINING

We carry out math SFT on our augmented math instructional dataset for mathematics reasoning, and therefore we obtain the chat model, MATH-QWEN-CHAT, directly. Owing to shorter average lengths of the math SFT data, we use a sequence length of 1024 for faster training. Most user inputs in the math SFT dataset are examination questions, and it is easy for the model to predict the input format and it is meaningless for the model to predict the input condition and numbers which could be random. Thus, we mask the inputs of the system and user to avoid loss computation on them and find masking them accelerates the convergence during our preliminary experiments. For optimization, we use the AdamW optimizer with the same hyperparameters of SFT except that we use a peak learning rate of 2 × 10−5 and a training step of 50,000.

5.2 EVALUATION

We evaluate models on the test sets of GSM8K (Grade school math) (Cobbe et al., 2021), MATH(Challenging competition math problems) (Hendrycks et al., 2021), Math401 (Arithmetic ability) (Yuan et al., 2023b), and Math23K (Chinese grade school math) (Wang et al., 2017). We compare MATH-QWEN-CHAT with proprietary models ChatGPT and Minerva (Lewkowycz et al., 2022) and open-sourced math-specialized model RFT (Yuan et al., 2023a), WizardMath (Luo et al., 2023a), and GAIRMath-Abel (Chern et al., 2023a) in Table 12. MATH-QWEN-CHAT models show better math reasoning and arithmetic abilities compared to open-sourced models and QWEN-CHAT models of similar sizes. Compared to proprietary models, MATH-QWEN-7B-CHAT outperforms Minerva-8B in MATH. MATH-QWEN-14B-CHAT is chasing Minerva-62B and GPT-3.5 in GSM8K and MATH and delivers better performance on arithmetic ability and Chinese math problems.

6.1 LARGE LANGUAGE MODELS

The excitement of LLM began with the introduction of the Transformer architecture (Vaswani et al., 2017), which was then applied to pretraining large-scale data by researchers such as Radford et al. (2018); Devlin et al. (2018); Liu et al. (2019). These efforts led to significant success in transfer learning, with model sizes growing from 100 million to over 10 billion parameters (Raffel et al., 2020; Shoeybi et al., 2019).

In 2020, the release of GPT-3, a massive language model that is 10 times larger than T5, demonstrated the incredible potential of few-shot and zero-shot learning through prompt engineering and in-context learning, and later chain-of-thought prompting (Wei et al., 2022c). This success has led to a number of studies exploring the possibilities of further scaling these models (Scao et al., 2022; Zhang et al., 2022; Du et al., 2021; Zeng et al., 2022; Lepikhin et al., 2020; Fedus et al., 2022; Du et al., 2022; Black et al., 2022; Rae et al., 2021; Hoffmann et al., 2022; Chowdhery et al., 2022; Thoppilan et al., 2022). As a result, the community has come to view these large language models as essential foundations for downstream models (Bommasani et al., 2021).

The birth of ChatGPT (OpenAI, 2022) and the subsequent launch of GPT-4 (OpenAI, 2023) marked two historic moments in the field of artificial intelligence, demonstrating that large language models (LLMs) can serve as effective AI assistants capable of communicating with humans. These events have sparked interests among researchers and developers in building language models that are aligned with human values and potentially even capable of achieving artificial general intelligence (AGI) (Anil et al., 2023; Anthropic, 2023a;b).

One notable development in this area is the emergence of open-source LLMs, specifically LLaMA (Touvron et al., 2023a) and LLAMA 2 (Touvron et al., 2023b), which have been recognized as the most powerful open-source language models ever created. This has led to a surge of activity in the open-source community (Wolf et al., 2019), with a series of large language models being developed collaboratively to build upon this progress (Mosaic ML, 2023; Almazrouei et al., 2023; ChatGLM2 Team, 2023; Yang et al., 2023; InternLM Team, 2023).

6.2 ALIGNMENT

The community was impressed by the surprising effectiveness of alignment on LLMs. Previously, LLMs without alignment often struggle with issues such as repetitive generation, hallucination, and deviation from human preferences. Since 2021, researchers have been diligently working on developing methods to enhance the performance of LLMs in downstream tasks (Wei et al., 2022a; Sanh et al., 2021; Longpre et al., 2023; Chung et al., 2022; Muennighoff et al., 2022). Furthermore, researchers have been actively exploring ways to align LLMs with human instructions (Ouyang et al., 2022; Askell et al., 2021; Bai et al., 2022b;c). One major challenge in alignment research is the difficulty of collecting data. While OpenAI has utilized its platform to gather human prompts or instructions, it is not feasible for others to collect such data.

However, there has been some progress in this area, such as the self-instruct approach proposed in Wang et al. (2023c). This innovative work offers a potential solution to the data collection problem in alignment research. As a result, there has been a surge in open-source chat data, including Alpaca (Taori et al., 2023), MOSS (Sun et al., 2023a), Dolly (Conover et al., 2023), Evol-Instruct (Xu et al., 2023b), and others (Sun et al., 2023b; Xu et al., 2023a;c; Chen et al., 2023c; Ding et al., 2023; Ji et al., 2023; Yang, 2023). Similarly, there has been an increase in open-source chat models, such as Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), Guanaco (Dettmers et al., 2023), MOSS (Sun et al., 2023a), WizardLM (Xu et al., 2023b), and others (Xu et al., 2023c; Chen et al., 2023c; Ding et al., 2023; Wang et al., 2023b).

To train an effective chat model, available solutions are mostly based on SFT and RLHF (Ouyang et al., 2022). While SFT is similar to pretraining, it focuses on instruction following using the aforementioned data. However, for many developers, the limited memory capacity is a major obstacle to further research in SFT. As a result, parameter-efficient tuning methods, such as LoRA (Hu et al., 1) and Q-LoRA (Dettmers et al., 2023), have gained popularity in the community. LoRA tunes only low-rank adapters, while


Qwen-7B GitHub Repository

We opensource Qwen-7B and Qwen-7B-Chat on both ModelScope and Hugging Face (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo link that provides more information.

Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include:

  1. Trained with high-quality pretraining data. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data.
  2. Strong performance. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.
  3. Better support of languages. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.
  4. Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
  5. Support of Plugins. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.

The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.

Performance

In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below.

Model MMLU C-Eval GSM8K HumanEval WMT22 (en-zh) CMMLU
LLaMA-7B 35.1 - 11.0 10.5 8.7 -
LLaMA 2-7B 45.3 - 14.6 12.8 17.9 -
Baichuan-7B 42.3 42.8 9.7 9.2 26.6 44.4
ChatGLM2-6B 47.9 51.7 32.4 9.2 - 48.8
InternLM-7B 51.0 52.8 31.2 10.4 14.8 -
Baichuan-13B 51.6 53.6 26.6 12.8 30.0 55.8
LLaMA-13B 46.9 35.5 17.8 15.8 12.0 -
LLaMA 2-13B 54.8 - 28.7 18.3 24.2 -
ChatGLM2-12B 56.2 61.6 40.9 - - -
Qwen-7B 56.7 59.6 51.6 24.4 30.6 58.8
Previous: Model | Bloom Next: Model | Google - PALM

post contain ""

    No matching posts found containing ""