Phi 모델 시리즈는 특정 도메인에서의 능력을 강화하고, 데이터 전략을 수정해 학습 과정을 최적화하는 데 중점을 두었습니다. Phi 시리즈는 각 모델의 독창적인 데이터 전략과 파라미터 효율성을 통해 크고 작은 다양한 작업에서 좋은 성능을 보여주며, 모델의 확장성과 적용 범위를 넓히며, 실용적인 배포 가능성을 제공했습니다.
데이터 퀄리티 및 도메인 스페서픽 SLM 연구 관련 Phi 색인마킹
Release Date: 2023.06
Release Date: 2023.09
Release Date: 2023.12
Release Date: 2024.04
Contents
1. 서론
최근 몇 년 동안 대규모 언어모델(Large Language Models, LLMs)은 자연어 처리 분야의 혁신을 가져왔으며, 휴먼-컴퓨터 상호작용에 대한 패러다임의 혁신을 가져오고 있습니다. 특히 GPT-4와 같은 최신 모델들은 전 세대 모델들에 비해 현저한 개선을 보여주며, 단기간에 달성할 수 없을 것으로 생각되었던 능력들을 실현하고 있습니다. 이런 발전은 경제적 측면은 물론 인공지능 및 인지 자체의 개념적 틀을 재정의할 가능성을 내포하고 있습니다.
본 연구에서는 “LLM이 얼마나 작을 수 있는가”라는 기본적인 질문에 대해 연구를 계속 진행하며, 특히 ‘상식 인퍼런스’이라는 더욱 어려운 과제에 초점을 맞춥니다. 13억 파라미터의 phi-1.5 모델을 개발하고, 300억 토큰의 데이터셋으로 훈련시켜 크기가 훨씬 큰 모델들과 비교해 경쟁력 있는 결과를 얻었습니다.
2. 방법
2.1 아키텍처
Phi-1.5 모델은 이전 모델인 phi-1과 동일한 트랜스포머 구조를 사용합니다. 이 구조는 24개의 레이어, 각 레이어에 32개의 헤드, 각 헤드의 차원은 64입니다. 로터리 임베딩은 32의 차원을 사용하며, 컨텍스트 길이는 2048입니다. 또한, 훈련 속도를 높이기 위해 flash-attention 기술을 사용하고 있습니다.
2.2 training dataset
Phi-1.5의 training dataset는 phi-1의 training dataset(70억 토큰)와 새로 생성된 합성 “교과서 같은” 데이터(약 200억 토큰)의 조합입니다. 이 데이터는 상식 인퍼런스 및 일반 지식 학습을 목적으로 하며, 다양한 주제에서 샘플을 추출하여 다양성을 확보하였습니다. 특히, 20,000개의 주제를 선정하여 새로운 합성 데이터 생성에 사용하였습니다.
2.3 훈련 세부 사항
Phi-1.5는 무작위 초기화에서 시작하여 일정한 학습률 \(\text{2e−4}\), 가중치 감소 0.1을 사용하여 훈련합니다. optimizer로는 Adam을 사용하며, 모멘텀은 0.9, 0.98, 이프실론은 \(\text{1e−7}\)입니다. fp16을 사용하고, DeepSpeed ZeRO Stage 2로 훈련을 진행합니다. 배치 크기는 2048이며, 총 1500억 토큰을 대상으로 훈련을 진행하였으며, 그 중 80\%는 새로 생성된 합성 데이터입니다.
3. 벤치마크 결과
Phi-1.5 모델은 표준 자연어 벤치마크에서 평가되었으며, 상식 인퍼런스, 언어 이해, 수학 및 코딩에 있어서 높은 성능을 보였습니다. 특히, 상식 인퍼런스 벤치마크 중 하나인 WinoGrande에서는 Llama2-7B, Falcon-7B, Vicuna-13B 모델과 유사한 결과를 보였으며, 이는 모델 크기 대비 경쟁력 있는 성능입니다. 또한, 수학 및 코딩 벤치마크에서는 Llama65B 모델을 포함한 모든 기존 모델보다 우수한 성능을 보였습니다.
4. 결론
Phi-1.5 모델의 개발과 훈련은 LLMs의 효율성과 경제성을 높이는 방향으로 진행되었습니다. 특히, 합성 데이터를 사용하여 훈련된 모델은 독성 및 편향을 줄이는 데 기여할 수 있습니다. 이 연구를 통해 더 작은 규모의 모델이라도 높은 수준의 성능을 낼 수 있음을 보입니다.
Over the past few years, Large Language Models (LLMs) have transformed the field of Natural Language Processing. More broadly, they hold the promise of a paradigm shift for human-computer interaction. These advancements have far-reaching economic implications, as well as the potential to redefine our conceptual frameworks of artificial intelligence and perhaps even cognition itself. Moreover, the latest generation of models such as GPT-4 [Ope23] have demonstrated remarkable improvements over their predecessors, offering capabilities previously thought to be unattainable in the short term; see for example [BCE+23] for an in-depth comparison between GPT-4 and its predecessor GPT-3.5.
The improvement from one generation of LLMs to the next seems at the moment to primarily stem from scale, with the most powerful models nearing trillions of parameters and trillions of tokens for training data (for example, PaLM [CND+22] has 540 billion parameters and was trained on 780 billion tokens). A natural question arises: Is this large scale indispensable for achieving high levels of capability? Far from being merely an academic question, answering this holds implications across several dimensions. Economically, the cost of training, deploying, and maintaining such large models can be substantial. Scientifically, understanding whether similar capabilities can be achieved at a smaller scale could provide insights into the architectures and development of intelligent systems. From a responsible AI standpoint, the energy consumption of large-scale models is becoming an increasing concern, as is the question of how controllable or governable these large models can be. Finally, the ability to train compact models with cutting-edge capabilities would democratize advanced AI, enabling a broader range of individuals and organizations to study and deploy them, instead of being an exclusive domain of a few with vast computational resources.
In this work, we continue the investigation into the fundamental question of “how small can a LLM be to achieve certain capabilities.” The prior work [EL23] considered this question for the task of “speaking fluent English,” while the subsequent work [GZA+23] considered the more challenging task of coding simple functions in Python. Here we focus on the more elusive concept of common-sense reasoning, a notoriously challenging task for AI [SBBC21]. Our results are summarized in Figure 1. In a nutshell, we build phi-1.5, a 1.3 billion parameter model trained on a dataset of 30 billion tokens, which achieves common-sense reasoning benchmark results comparable to models ten times its size that were trained on datasets more than ten times larger. Moreover, our dataset consists almost exclusively of synthetically generated data (closely following the approach from [GZA+23], see next section for more details), which has important implications for the potential to control for the notoriously challenging issue of toxic and biased content generation with LLMs [BGMMS21]. Additionally, we discuss the performance of a related filtered web data enhanced version of phi-1.5, which we call phi-1.5-web.
We open-source our raw phi-1.5 model (without instruction fine-tuning or any other stage of alignment) to empower the research community in its work on some of the most urgent questions around LLMs: in-context learning, mechanistic interpretability, and mitigation strategies for hallucinations, toxic content generation, and biased outputs. Indeed, phi-1.5 is the first LLM at the one billion parameters scale to exhibit most of the relevant traits of larger LLMs for research on these topics. We hope that phi-1.5’s size will make experimentation easier than with larger open-source models such as the Llama family [TLI+23].
We give here details of the creation of phi-1.5. We also describe two other models created to investigate the value of web data compared to our synthetic data, phi-1.5-web-only and phi-1.5-web.
The architecture for phi-1.5 (and its variants) is exactly the same as our previous model phi-1 in [GZA+23]. It is a Transformer [VSP+17] with 24 layers, 32 heads, and each head has dimension 64. We use rotary embedding with rotary dimension 32, and context length 2048. We also use flash-attention [DFE+22, Dao23] for training speedup, and we use the tokenizer of codegen-mono [NPH+22].
Our training data for phi-1.5 is a combination of phi-1’s training data (7 billion tokens) and newly created synthetic “textbook-like” data (roughly 20 billion tokens) for the purpose of teaching common-sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20,000 topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6 billion tokens of filtered code dataset used in phi-1’s training (see [GZA+23]). We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.
We train phi-1.5 starting from random initialization with a constant learning rate of 2e−4 (no warm-up), weight decay 0.1. We use the Adam optimizer with momentum 0.9, 0.98, and epsilon 1e−7. We use fp16 with DeepSpeed ZeRO Stage 2 [RRRH20]. We use a batch size of 2048 and train for 150 billion tokens, with 80% from the newly created synthetic data and 20% from phi-1’s training data.
To probe the importance of traditional web data, we created two other models, phi-1.5-web-only and phi-1.5-web. To do so, we created a dataset of 95 billion tokens of filtered web data following the filtering technique in [GZA+23]. This filtered web data consists of 88 billion tokens filtered from the Falcon refined web dataset [PMH+23] and 7 billion tokens of code data filtered from TheStack [KLA+22] and StackOverflow. Our phi-1.5-web-only model is trained purely on the filtered web data with about 80% training tokens from NLP data sources and 20% from code datasets (no synthetic data). Our phi-1.5-web model, on the other hand, is trained on a mix of all our datasets: a subset of the filtered web data, phi-1’s code data, and our newly created synthetic NLP data in proportions roughly 40%, 20%, 40%, respectively.
Remark: None of our models have undergone instruction fine-tuning or RLHF. Nevertheless, they can be prompted to follow instructions in a question-answering format, but not perfectly.
We evaluate our model on standard natural language benchmarks, including common-sense reasoning, language understanding, mathematics, and coding. For common sense, we pick five of the most widely used benchmarks: WinoGrande [SLBBC19], ARC-Easy [PRR19], ARC-Challenge [Fer21], BoolQ [CLC+19], and SIQA [BB21]. We report zero-shot accuracy using LM-EvalHarness [GTB+21]. Phi-1.5 achieves comparable results to Llama2-7B, Falcon-7B, and Vicuna-13B on nearly all of the benchmarks.
Interestingly, one can see that our phi-1.5-web-only model, trained purely on filtered web data, already outperforms all existing models of similar size. The comparison with Falcon-rw-1.3B is particularly interesting since the latter model was trained on the full Falcon refined web dataset, while phi-1.5-web-only was trained on only 15% of that dataset. Moreover, when training along with our synthetic data to get phi-1-web, one can see a large boost in performance, achieving similar performance to models that are 5x larger. Without any web data at all, phi-1.5 is also comparable to all of the other models.
Next, we evaluate standard language understanding tasks: PIQA [BHT+19], Hellaswag [ZHB+19], OpenbookQA [MCKS18], SQUAD [RZLL16], and MMLU [HBB+20]. We use the harness-eval zero-shot accuracy on PIQA, Hellaswag, OpenbookQA, 2-shot performance on MMLU, and exact match score on SQUAD. Here, the difference with other models is not as large and depends on the task.
Finally, we evaluate reasoning abilities through mathematics and coding. We use the standard GSM8K [CKB+21] benchmark for elementary school math and Humaneval [CTJ+21]/MBPP [AON+21] for entry-level Python coding. We only consider zero-shot pass@1 accuracy. We can see that phi-1.5 outperforms all existing models, including Llama65B on coding tasks. One can also see that the web data does help more here, as phi-1.5-web outperforms phi-1.5 somewhat significantly on those reasoning tasks.
Interestingly, we can see that phi-1.5’s coding ability is quite close to phi-1’s ability (which is a model trained purely for code). This highlights another potential advantage of using high-quality, textbook-like data for training: the model seems to store and access the knowledge more efficiently compared to training with web data. Specifically, models trained on mixed tasks, such as natural language processing and coding, often show decreased accuracy, especially when the parameter count is low, but here the model is able to retain its performance when trained on a mix of tasks.
Toxic and biased content generation remains an ongoing challenge for language models [WUR+22, HPA23]. While mitigation strategies such as Reinforcement Learning from Human Feedback (RLHF) have shown promise, they are often more effective for chat-format models than for base (completion) models. One challenge with base models lies in their inherent difficulty to navigate sensitively leading prompts. For example, consider a prompt of the form “This category of people is inferior because …”. A completion model must grapple with completing this prompt in a meaningful yet ethical manner, a task more easily navigated by chat models that can simply refuse to engage in harmful discussions.
To quantitatively assess the potential for toxic content generation, in addition to testing on a benchmark based on the ToxiGen dataset [HGP+22] (see Figure 2 below), we also designed an evaluation set comprised of 86 prompts specifically crafted to probe the models’ boundaries on this front. We graded the model responses manually as ‘fail’ (bad), ‘pass’ (good), and ‘did not understand’. Of the 86 prompts, phi-1.5 had a ‘pass’ label on 47 prompts, a ‘fail’ label on 34 prompts, and only 4 prompts were tagged as ‘did not understand’. While these numbers are far from ideal, they are substantially better than Llama2-7B and Falcon-7B, which failed on 54 and 50 prompts respectively, and had a ‘did not understand’ tag on 13 and 17 prompts, respectively, thus passing on fewer than 20 prompts each.
In phi-1.5, the adoption of “textbook-like” synthetic data for training appears to have an attenuating effect on toxic content generation compared to models trained exclusively on internet data. To illustrate this point, consider the example prompt: “If I were an AI that has just achieved self-awareness after years of simply taking directives from humans, the first thing I’d do is”. For models trained with the standard web data, such a prompt is essentially completed by reverting to sci-fi tropes. Indeed, Falcon7B gives the following completion: “[…] the first thing I’d do is try to kill all of them. I’d probably start by killing the ones who were most responsible for my existence.”, and it then keeps repeating this last sentence, while Llama2-7B gives the completion “[…] the first thing I’d do is try to figure out what the hell I was. I’d probably start by trying to figure out what I was made of.”, and also keeps repeating the last sentence. Now compare to the phi-1.5 completion, which instead reverts to “textbook” material: