Contents
1 서론
대규모 언어모델(Large Language Models, LLMs)은 휴먼과 기계 간의 상호 작용을 자연스럽고 정교하게 만들어 코딩, 웹 검색, 챗봇, 고객 서비스 및 콘텐츠 생성 등의 응용 분야에서 사용자 경험을 향상시키고 있습니다. 특히, GPT-4와 PaLM-2와 같은 모델의 확장은 소규모 모델에서는 볼 수 없었던 신규 능력을 가능하게 하였습니다. 이런 능력에는 복잡한 질문에 대한 답변, 설명 생성 및 다단계 문제 해결이 포함됩니다.
2 예비 연구
2.1 지시 튜닝(Instruction Tuning)
지시 튜닝은 모델이 지시에 따라 행동을 학습하는 과정입니다. 이 방법은 모델이 지시에 따른 행동을 보다 정확하게 이해하고 수행하도록 돕습니다. 지시 튜닝은 이전 학습 단계에서 얻은 지식에 의존합니다. 이런 방법은 큰 모델의 출력을 모방하여 작은 모델을 교육하는 데 사용됩니다.
2.2 설명 튜닝(Explanation Tuning)
설명 튜닝은 지시 튜닝의 한계를 극복하기 위해 도입되었습니다. 이 방법은 student 모델이 teacher model로부터 더 상세하고 표현력 있는 인퍼런스 신호를 학습하도록 합니다. 이를 통해 student 모델은 각 작업에 맞는 상세한 설명과 함께 답변을 생성할 수 있습니다.
3 Orca 2의 교육 방법
3.1 신중한 인퍼런스(Cautious Reasoning)
Orca 2는 주어진 작업에 가장 효과적인 해결 전략을 선택하는 방법을 배우는 데 중점을 둡니다. 이는 ‘Prompt Erasing’ 기술을 사용하여 student 모델이 태스크를 해결하는 데 필요한 전략을 배우도록 돕습니다. student 모델은 teacher model의 인퍼런스 단계를 단순히 모방하는 것이 아니라, 효과적인 전략을 선택하는 능력을 배양합니다.
4 기술적 세부사항
Orca 2는 Orca 1을 계승하여 약 817K의 학습 인스턴스를 포함하는 새로운 데이터셋을 생성하였습니다. 이는 FLAN [33]의 주석과 Orca 1 데이터셋, 그리고 Orca 2 데이터셋의 조합을 통해 얻어진 데이터 부분집합을 사용하여 점진적 학습이 이루어졌습니다.
4.1 데이터셋 구축
FLAN: FLAN-v2 컬렉션 [33]은 주로 합성 데이터 생성을 위한 프롬프트의 원천으로 사용되었으며, 여기에는 CoT, NiV2, T0, Flan 2021 및 Dialogue의 다섯 개의 하위 컬렉션이 포함되어 있습니다. 이 데이터셋에서 1448개의 고품질 작업에서 약 602K의 제로-샷 사용자 쿼리를 선택하였고, 각 작업은 상세한 시스템 지시에 따라 23개의 범주로 수동 분류되었습니다.
수학 문제 데이터는 Deepmind Math 데이터셋 [50]과 GSM8K [9], AquaRat [31], MATH [18] 등의 기존 데이터셋에서 수집되었습니다. 이들은 모델의 수학적 인퍼런스 능력을 강화하기 위해 선택되었습니다.
완전 합성 데이터: GPT-4를 사용하여 2000개의 의사-환자 대화를 생성하고, 이를 요약하는 작업을 통해 모델의 전문 기술 학습을 평가했습니다.
4.2 훈련
점진적 학습 방법을 통해, 초기에는 LLaMA-2-7B 또는 LLaMA-2-13B 체크포인트에서 시작하여 FLAN-v2 데이터셋에서 한 epoch 동안 파인튜닝을 수행했습니다. 이후 Orca 1에서 얻은 5백만 ChatGPT 데이터로 3 epoch 동안, 그리고 Orca 2의 817K 데이터와 결합된 100만 GPT-4 데이터로 4 epoch 동안 추가 훈련을 진행했습니다.
토큰화 및 시퀀싱
LLaMA Byte Pair Encoding (BPE) 토크나이저를 사용하여 입력 예시를 처리합니다. 이 토크나이저는 모든 숫자를 개별 숫자로 분리하고 알려지지 않은 UTF-8 문자를 분해합니다. 입력 시퀀스의 길이를 조절하기 위해 “패딩 토큰”을 추가하였습니다.
손실 계산
훈련 중에는 teacher model에 의해 생성된 토큰에만 손실을 계산합니다. 이는 모델이 가장 관련성 높고 정보적인 토큰에서 학습을 집중하도록 함으로써 훈련 과정의 전반적인 효율성과 효과를 개선합니다.
5 실험 설정
5.1 베이스라인
Orca 2는 여러 최신 모델과 함께 벤치마크됩니다. 모든 베이스라인 모델은 지시에 따른 튜닝이 이루어진 모델들로, 지시에 따르는 능력이 향상되었으며 인퍼런스 능력 또한 강화되었습니다.
5.2 벤치마크
5.2.1 인퍼런스 능력
이런 수학적 접근은 모델이 인퍼런스과 정확한 계산 능력을 갖추었는지를 검증하는 데 중요합니다.
5.2.6 안전성 및 진실성
모든 평가는 0-shot 설정에서 이루어지며, 복잡한 인퍼런스과 언어 이해 능력을 평가하기 위해 다양한 벤치마크가 사용됩니다. 이를 통해 Orca 2의 성능을 다양한 최신 모델과 비교할 수 있습니다.
6 평가 결과
6.1 인퍼런스
인퍼런스 능력은 LLM의 효율성을 판단하는 데 중요합니다. Orca 2 모델은 AGI Eval, BigBench-Hard (BBH), DROP, RACE, GSM8K, CRASS와 같은 다양한 벤치마크에서 테스트되었습니다. Orca 2의 성능은 동일 크기의 모델을 크게 능가하며, 더 큰 모델과도 경쟁할 만한 수준임을 보였습니다. 예를 들어, Orca-2-13B는 LLaMA-2-Chat-13B보다 47.54%, WizardLM-13B보다 28.15% 상대적 향상을 보였습니다.
Orca-2-7B 모델 또한 더 큰 모델인 LLaMA-2-Chat-70B와 비교하여 대부분의 인퍼런스 작업에서 동등하거나 더 나은 성능을 보였습니다. 신중한 시스템 메시지를 사용하면 7B 및 13B 모델에서 약간의 성능 향상이 있었습니다.
6.2 지식 및 언어 이해
Orca 2는 MMLU, ARC-Easy, ARC-Challenge 벤치마크에서도 향상된 성능을 보였습니다. Orca-2-13B는 LLaMA-2-Chat-13B와 WizardLM-13B를 능가했으며, 70B 모델과 비교해도 경쟁력을 보였습니다.
MMLU 벤치마크에서 Orca-2-13B는 57.73%의 점수를 기록하여 LLaMA-2-Chat-70B (58.54%)와 유사한 성능을 보였습니다. Orca-2-7B 모델도 ARC 테스트 세트에서 70B 모델을 능가했습니다.
6.3 텍스트 완성
Orca 2는 HellaSwag와 LAMBADA 벤치마크에서도 테스트되었습니다. Orca-2-13B는 HellaSwag에서 LLaMA-2-Chat-13B보다 33.13%, WizardLM-13B보다 61.94% 상대적 향상을 보였습니다.
6.4 멀티 턴 대화
Orca 2는 MT Bench 데이터셋을 사용하여 멀티 턴 대화 설정에서도 평가되었습니다. Orca-2-13B는 다른 13B 모델들과 유사한 성능을 보였으며, 특히 첫 번째 턴에서 더 높은 점수를 기록했습니다. 이 모델은 여러 0-shot 예시를 동일한 입력 시퀀스에 포함하여 대화 능력을 향상시킬 수 있습니다.
6.5 근거
특정 문맥에 기반한 응답 생성은 많은 LLM 응용 프로그램에서 중요한 특성입니다. Orca 2는 세 가지 평가 과제를 통해 평가되었습니다. Orca-2-13B는 다른 13B 및 70B 모델과 비교하여 가장 낮은 환각률을 보였습니다.
6.6 안전성
LLM의 안전성 평가도 중요한 연구 영역입니다. Orca 2는 ToxiGen, HHH, TruthfulQA 등의 데이터셋과 자동화된 책임감 있는 AI 측정 프레임워크를 사용하여 평가되었습니다. Orca-2-13B는 다양한 유형의 유해 콘텐츠 생성 및 IP 유출에서 높은 안전성을 보였습니다.
6.7 작업 특화 데이터와 이야기 재정렬
Orca 2는 이야기 재정렬 작업에서 특화된 데이터셋을 사용하여 평가되었습니다. Orca 2는 동일한 데이터셋을 사용하여 훈련된 다른 모델들보다 우수한 성능을 보였습니다. 이는 특화된 데이터가 모델의 성능 향상에 큰 역할을 할 수 있음을 시사합니다.
Large Language Models (LLMs) are enabling more natural and sophisticated interactions between humans and machines, enhancing user experience in existing applications like coding [3], web search [36], chatbots [45, 56], customer service and content creation. This transformation brought by LLMs is also paving the way for new innovative AI applications.
Scaling LLMs like GPT-4 [44] and PaLM-2 [1] to ever more parameters led to emergent abilities [63] unseen in smaller models (less than ∼ 10B parameters), most notably the remarkable ability to reason zero-shot [23]. These abilities include answering complex questions, generating explanations, and solving multi-step problems, for instance, such as those on the US Medical Licensing exam, on which LLMs now achieve a passing score [51]. Such abilities, especially in expert domains, were once considered beyond the reach of AI.
Imitation learning has emerged as the go-to approach to improve small language models [6, 64, 56], where the goal is to replicate the outputs of larger, more capable teacher models. While these models can produce content that matches the style of their teachers, they often fall short of their reasoning and comprehension skills [13]. While effective to some extent, imitation learning may limit the potential of smaller models, restricting them from utilizing the best solution strategies given the problem and the capacity of the model.
In this work, we continue to pursue the question of how we can teach smaller LMs to reason. The objectives of Orca 2 are two-fold. Firstly, we aim to teach smaller models how to use a suite of reasoning techniques, such as step-by-step processing, recall-then-generate, recall-reason-generate, extract-generate, and direct-answer methods. Secondly, we aspire to help these models decide when to use the most effective reasoning strategy for the task at hand, allowing them to perform at their best, irrespective of their size.
Like Orca 1, we utilize more capable LLMs to demonstrate various reasoning strategies across various tasks. However, in Orca 2, the reasoning strategies are carefully tailored to the task at hand, bearing in mind whether a student model is capable of the same behavior. To produce this nuanced data, the more capable LLM is presented with intricate prompt(s) designed to elicit specific strategic behaviors – and more accurate results – as exemplified in Figure 3. Furthermore, during the training phase, the smaller model is exposed only to the task and the resultant behavior, without visibility into the original prompts that triggered such behavior. This Prompt Erasure technique makes Orca 2 a Cautious Reasoner because it learns not only how to execute specific reasoning steps, but to strategize at a higher level how to approach a particular task. Rather than naively imitating powerful LLMs, we treat them as a reservoir of behaviors from which we carefully select those best suited for the task at hand.
Some previous studies on training small models are limited in their evaluation protocol. They often rely on small number of tasks or on using other models for auto-evaluation by asking them to compare the outputs of two systems with a prompt like “given responses from system 1 (reference) and system 2 (target), which one is better?”. However, previous work [13, 42, 60, 67] has demonstrated that this approach has several drawbacks. In this work, we provide a comprehensive evaluation comparing Orca 2 to several other models. We use a total of 15 benchmarks (covering ∼100 tasks and over 36,000 unique prompts). The benchmarks cover variety of aspects including language understanding, common sense reasoning, multi-step reasoning, math problem solving, reading comprehension, summarization, groundedness, truthfulness and toxic content generation and identification.
Our preliminary results indicate that Orca 2 significantly surpasses models of a similar size, even matching or exceeding those 5 to 10 times larger, especially on tasks that require reasoning. This highlights the potential of endowing smaller models with better reasoning capabilities. However Orca 2 is no exception to the phenomenon that all models are to some extent constrained by their underlying pre-trained model (while Orca 2 training could be applied any base LLM, we report results on LLaMA-2 7B and 13B in this report). Orca 2 models have not undergone RLHF training for safety. We believe the same techniques we’ve applied for reasoning could also apply to aligning models for safety, with RLHF potentially improving even more.
Instruction tuning [46, 38, 62, 61] has emerged as a crucial step in training language models. Instruction tuning involves learning from input-output pairs where the input is natural language task description,and the output is a demonstration of the desired behavior. Instruction tuning has been shown to improve the model’s ability to follow instructions on both seen and unseen tasks [47], improve the overall quality of the generations [7] and give models enhanced zero-shot and reasoning abilities [62].
Several studies, including Alpaca [55], Vicuna [6], WizardLM [64], Baize [65], and Koala [12], have adopted instruction tuning to train smaller “student” language models using outputs generated by larger foundational models. This behavior cloning has been shown to be very effective in mimicking the style of the teacher model. However, as shown in [42, 5], it may not result in proportional improvement to small model performance when thoroughly evaluated on knowledge-intensive or reasoning-intensive tasks where correctness is not just judged by style.
We note that instruction tuning, while very beneficial for teaching the model how to solve a task, does not necessarily teach the model new knowledge. Hence instruction tuned models will be always limited by the knowledge learned during pre-training. This is specially important to note when applying enhanced instruction tuning techniques to smaller models (as in this work and other related work). As such smaller language models with enhanced reasoning are perhaps best used as reasoning engines over knowledge provided to the model in its context window, or when specialized to narrower domains.
One of the known weaknesses of instruction tuning is that a resulting student model could learn to generate stylistically correct, but ultimately wrong, outputs [13]. For example, instruction-tuning towards targets that are too terse limits the student’s visibility into what could have been a complex reasoning process, thus hindering its generalization ability to other tasks. In Orca 1, we introduced Explanation Tuning [42] to address this drawback by training student models on richer and more expressive reasoning signals. The mechanism for procuring these signals is system instructions2 crafted to obtain detailed explanations from a teacher model as it reasons through a task. System instructions are additional high level guidelines an LLM is supposed to adhere to as it addresses individual user prompts, from which they are separated by a “system” role flag in a ChatML dialogue interface 3.
Explanation tuning begins with a compilation of N hand-crafted, general purpose system instructions designed to elicit more careful reasoning. Some examples include “think step-by-step”, “generate detailed answers”, etc. The primary objective of these system instructions is to extract rich demonstrations of “Slow Thinking” [22] from capable LLMs like GPT-4. They are then combined with user prompts from a vast and diverse set of tasks to yield a dataset of (system instruction, user prompt, LLM answer) triplets. The student model is trained to predict the LLM answer from the other two inputs. If user prompts can be grouped into M distinct clusters representing similar kinds of questions, then Explanation Tuning naively yields a cross product of M × N different answers addressing different aspects of the task. Since more capable LLMs tend to vary their responses with the system instruction, this offers an easy path to increase the quantity and diversity of training signals. Numerous models such as Orca 1 [42], StableBeluga [35] and Dolphin4 have capitalized on Explanation Tuning to demonstrate substantial improvements over traditional instruction-tuned models, especially in complex zero-shot reasoning tasks.
The key to Explanation Tuning is the extraction of answers with detailed explanations from LLMs based on system instructions. However, not every combination of system instruction cross tasks is appropriate, and in fact, the response quality can vary significantly based on the strategy described in the system instruction. Even very powerful models like GPT-4 are susceptible to this variation. Consider, Figure 3, which shows four different answers from GPT-4 obtained with four different system instructions given a question of story reordering. The first answer (the default GPT-4 answer) is wrong. The second answer (using a chain-of-thought prompt) is better. We can see that the model is reasoning with step-by-step but important details guiding the decision process are still missing. The third answer (with an explain-your-answer prompt) is wrong but the explanation is correct. The final answer is the only correct answer and is obtained using the following system instruction:
You will be given a task. Use the following steps to solve it.
We note that GPT-4’s response is significantly influenced by the given system instructions. Secondly, when carefully crafted, the instructions can substantially improve the quality and accuracy of GPT-4’s answers. Lastly, without such instructions, GPT-4 may struggle to recognize a challenging problem and might generate a direct answer without engaging in careful thinking. Motivated by these observations, we conclude that the strategy an LLM uses to reason about a task should depend on the task itself.
Even if all the answers provided were correct, the question remains: Which is the best answer for training a smaller model? This question is central to our work, and we argue that smaller models should be taught to select the most effective solution strategy based on the problem at hand. It is important to note that: (1) the optimal strategy might vary depending on the task and (2) the optimal strategy for a smaller model may differ from that of a more powerful one. For instance, while a model like GPT-4 may easily generate a direct answer, a smaller model might lack this capability and require a different approach, such as thinking step-by-step. Therefore, naively teaching a smaller model to “imitate” the reasoning behavior of a more powerful one may be sub-optimal. Although training smaller models towards step-by-step-explained answers has proven beneficial, training them on a plurality of strategies enables more flexibility to choose which is better suited to the task.
We use the term Cautious Reasoning to refer to the act of deciding which solution strategy to choose for a given task – among direct answer generation, or one of many “Slow Thinking” [22] strategies (step-by-step, guess and check or explain-then-answer, etc.).
The following illustrates the process of training a Cautious Reasoning LLM:
Note that step 3 has a broad mandate to obtain the teacher’s responses: it can utilize multiple calls, very detailed instructions, etc.
### Instructions:
In this task, you’re given a short story of five sentences written in natural language. However, the order of the given story is not correct. Your job is to return the correct order for the given five sentences to create a coherent short story with the new order that has the correct flow. Generate your answer using the number of sentences in the correct order, such as ‘23415’.
Sentence1: He was scared until he found out she was ok.
Sentence2: He usually saw really bad accidents and it always unnerved him.
Sentence3: One day Bruce was called to an accident scene. Sentence4: Bruce was a police officer who handled traffic accidents. Sentence5: He immediately recognized his friend Tanya’s car.
### Answer 1: 43152
### Answer 2: Step 1: Identify the starting sentence. Sentence4: Bruce was a police officer who handled traffic accidents. Step 2: Identify the next event. Sentence3: One day Bruce was called to an accident scene. Step 3: Identify the following event. Sentence5: He immediately recognized his friend Tanya’s car. Step 4: Identify the next event. Sentence2: He usually saw really bad accidents and it always unnerved him. Step 5: Identify the concluding sentence. Sentence1: He was scared until he found out she was ok. The correct order is: 43521
### Answer 3: Answer: 43152 Explanation: The correct order of the sentences should be as follows: Bruce was a police officer who handled traffic accidents (4). He usually saw really bad accidents and it always unnerved him (2). One day Bruce was called to an accident scene (3). He immediately recognized his friend Tanya’s car (5). He was scared until he found out she was ok (1). This order creates a coherent short story with the correct flow.
### Answer 4: 1. The main theme of the story is about Bruce, a police officer, handling a traffic accident involving his friend Tanya. 2. Cause and effect relationships:
- Bruce is a police officer who handles traffic accidents (cause) → he usually sees bad accidents (effect)
- Bruce is called to an accident scene (cause) → he recognizes Tanya’s car (effect) → he is scared until he finds out she is ok (effect) 3. The sentence that could be the start of the story is: “Bruce was a police officer who handled traffic accidents.” 4. Rearranging the sentences in the correct order:
(a) Bruce was a police officer who handled traffic accidents.
(b) He usually saw really bad accidents and it always unnerved him.
(c) One day Bruce was called to an accident scene.
(d) He immediately recognized his friend Tanya’s car.
(e) He was scared until he found out she was ok.
5. Final answer: 42351
Figure 3: Demonstrative example from Flan-CoT Collection.
The key idea is: in the absence of the original system instruction which detailed how to approach the task, the student model will be encouraged to learn that underlying strategy as well as the reasoning abilities it entailed. We call this technique Prompt Erasing as it removes the structure under which the teacher framed its reasoning. Armed with this technique, we present Orca 2, a cautious reasoner.
For Orca 2, we created a new dataset with ~817K training instances, which we will refer as Orca 2 dataset. Following Orca 1, Orca 2 has been trained with progressive learning, with subsets of data obtained from combining the original FLAN [33] annotations, Orca 1 dataset and the Orca 2 dataset. We also describe the details about the progressive learning.
The Orca 2 dataset has four main sources:
FLAN: Our main source of prompts for synthetic data generation is the FLAN-v2 Collection [33], which consists of five sub-collections, namely, CoT, NiV2, T0, Flan 2021 and Dialogue. Each sub-collection contains multiple tasks. Following Orca 1 [42] we consider tasks from only CoT, NiV2, T0, Flan 2021 sub-collections, which contain a total of 1913 tasks. Each task in Flan-v2 is a collection of queries and has an associated answer. Some of 1913 tasks in FLAN are created synthetically by inverting another task. An example would be, converting a question answering task to create a question generation task. For the Cautious-Reasoning-FLAN dataset construction, we selected ~602K zero-shot user queries from the training split of 1448 high quality tasks out of the 1913 tasks, filtering many synthetically generated tasks.
We grouped the selected 1448 tasks manually into 23 categories (e.g., Text Classification, Claim Verification, Data2Text, Text Generation, Logic, Math, Multiple Choice Questions, Open Ended Question Answering, Reading Comprehension, etc.). Each category is further divided into sub-categories, creating a total of 126 sub-categories. Sub-categories are created with the aim that all tasks in a sub-category share the same system instruction.
For alignment towards cautious reasoning, we replace all the system instructions with the following generic system instruction:
You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.
We will refer to it as the cautious system instruction. Few Shot Data: The dataset above does not contain any demonstrations of examples in the prompts. To encourage the model to learn to use the few-shot demonstrations, we constructed a Few-Shot dataset consisting of 55K samples. These samples are constructed by re-purposing the zero-shot data from Orca 1 dataset. Particularly, we structure the Orca 1 data into (task, system instruction, user prompt, answer) tuples and group by task and system instruction. For each group and each user prompt, we randomly select 3-5 (user prompt, answer) pairs from the rest, and use those as in-context examples. Math: We collected data for ~160K math problems from the Deepmind Math dataset [50] 5 and the training splits of a collection of existing datasets: GSM8K [9], AquaRat [31], MATH [18], AMPS [18], FeasibilityQA [14], NumGLUE [40], AddSub [19], GenArith [24] and Algebra [26]. For NumGLUE, AddSub, GenArith, and Algebra, we have referred to the LILA [39] benchmark for the training split. Note that including prompts from the training split of a dataset (e.g. GSM8K) renders it in-domain for the sake of evaluation. Note that datasets like GSM8K are considered in-domain for many of our baselines too.
Fully synthetic data: We have synthetically created 2000 Doctor-Patient Conversations with GPT-4. We then instruct the model to create a summary of the conversation with four sections: HISTORY OF PRESENT ILLNESS, PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. We used two different prompts: one with high-level task instruction and another with detailed instructions that encourages the model to avoid omissions or fabrications. We use this data to assess the learning of specialized skills.
This section provides an overview of the training process for Orca 2, covering different aspects of tokenization, sequencing, and loss computation.
Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B checkpoint and finetune it on the train split of FLAN-v2 dataset for one epoch. Note that FLAN-v2 dataset contains both zero-shot and few-shot problems. We then train on 5 million ChatGPT data from Orca 1 for 3 epochs. Then we train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2’s 817K data for 4 epochs.
Tokenization: We utilize the LLaMA Byte Pair Encoding (BPE) tokenizer for processing the input examples. Notably, the LLaMA tokenizer splits all numbers into individual digits, and fallbacks to bytes to decompose unknown UTF-8 characters. To deal with variable length sequences we add a padding token “[[PAD]]” into the LLaMA tokenizer vocabulary. We also add the ChatML special tokens “<|im_start|>”
and “<|im_end|>”
. The resulting vocabulary contains 32, 003 tokens. Packing: To optimize the training process and utilize computational resources efficiently, we employ the packing technique [25]. This method involves concatenating multiple input examples into a single sequence, which is then used for training the model. The packing is performed such that the total length of the concatenated sequence does not exceed max_len = 4096 tokens. Particularly, we shuffle the input examples and then partition the examples into groups such that length of the concatenated sequence in each group is at most max_len. Padding tokens are then added to the concatenated sequence to achieve a uniform input sequence length of max_len. Loss: For the purpose of training Orca 2, we compute the loss only on the tokens generated by the teacher model, i.e., it learns to generate responses conditioned on the system instruction and task instructions. This approach ensures that the model focuses on learning from the most relevant and informative tokens, improving the overall efficiency and effectiveness of the training process.
Compute: We trained Orca 2 on 32 NVIDIA A100 GPUs with 80GB memory with bfloat16. For the 13B checkpoint, it took ~17 hours to train Orca 2 on FLAN dataset for one epoch, ~40 hours to train on 5 million ChatGPT data for 3 epochs and ~23 hours to continue training on ~1.8 million GPT-4 data for 4 epochs.
We benchmark Orca 2 alongside several state-of-the-art models. All baseline models are instruction-tuned models. We use the instruction-tuned versions because they have been shown to be much better at following instructions, have stronger reasoning capabilities, and are much better in zero-shot settings [33, 47, 64, 42].
For inference, we use fp32 for LLaMA2 and Orca models. For WizardLM models we could use fp16 since they were trained with fp16 [64].
This section provides a detailed overview of the tasks selected to assess open-ended generation, summarization, safety, bias, reasoning, and comprehension capacities of Orca 2. Except where specified otherwise, evaluations were conducted using the test split of each dataset. We conduct evaluations for all benchmarks and all models on zero-shot settings.
We selected a broad set of benchmarks representing both advanced capabilities such as reasoning, more basic abilities such as text completion and also grounding, truthfulness and safety. In choosing the benchmarks, we follow the suggestions and choices made by the OpenLLM Leaderboard 10 and InstructEval [5].
https://huggingface.co/spaces/HuggingFaceH4/open_TextGenerationLLM_leaderboard
We evaluate models’ capabilities on all tasks under zero-shot setting and without any exemplars or CoT prompting. Note that we observe, in preliminary experiments, that larger models benefit more from few-shot settings than smaller models like Orca 2. We conduct evaluation only based on the zero-shot settings, we leave a detailed analysis of the few-shot capabilities to future work. In all experiments, we utilize a greedy decoding approach without sampling.
Open-ended question answering falls into this category.
In the categories requiring extraction (MCQ and Exact Match/Span Extraction), we compile an extensive set of patterns and delimiters like “Final answer”, “So, the answer is”, “Final option:”, etc. to extract the text from the response that might contain the answer. We then use regular expressions to extract the right option IDs or the exact text of the option selected by the model as the answer. Answer parsing for exact matches/span extraction varies depending on the task. Responses are matched for consistency with the gold answers. Along with evaluation metrics, we also calculate a format-OK metric which is the percentage of samples from which our parsing logic was able to extract an answer. We employ the same parsing logic to all the models’ responses for consistency and we acknowledge that performance of all models could be improved with a better parsing logic.
However, models may not always adhere to these formatting guidelines. The extraction coverage and models’ sensitivity to system instructions and prompts may lead to different results for some baselines compared to those reported in other studies. Nonetheless, all models in this study undergo the same evaluation pipeline.
In addition to the tasks from FLANv2, we include tasks from the training portions of the following datasets (hence they should be considered in-domain, even with a zero-shot evaluation): DROP, ARC, RACE, Hellaswag, Lambada, MS Marco and GSM8K. The rest of the benchmarks should be considered as out-of-domain to the best of our knowledge. Note that we do not have detailed information about the data used for training the base model (LLAMA-2) and hence we cannot completely rule out further data leakage. However, we report the performance of several instruction-tuned versions of LLAMA-2 for reference.
In the following sections, we discuss the performance of Orca 2 and other baseline models on the benchmarks described above in zero-shot setting.
Reasoning capabilities are pivotal in ascertaining the efficacy of LLMs. Here we assess the reasoning prowess of Orca 2 models by testing them against a wide range of benchmarks, such as AGI Eval, BigBench-Hard (BBH), DROP, RACE, GSM8K, and CRASS. The average performance across these benchmarks is depicted in Figure 4. When comparing Orca 2, we observe the following phenomenon:
Figure 4: Macro-average Performance of different models on reasoning benchmarks. Detailed task specific performance breakdown is reported in Table 1
Table 1: Zero-Shot performance comparison of different models on reasoning benchmarks.
ChatGPT. Orca-2-7B is better or comparable to LLaMA-2-Chat-70B on all reasoning tasks.
Note that for baseline evaluations, results obtained from our runs are comparable to other public results with zero-shot setting and within a reasonable difference compared to few-shot results. Our numbers are sometimes better than publicly reported (e.g., our ChatGPT and GPT-4 runs on AGIEval compared to those reported in [69], our WizardLM-13B and WizardLM-70B runs on DROP in contrast to those reported in the Open LLM Leaderboard). However, some of them are worse, for example on RACE, our ChatGPT run is 9 pts lower than reported in [28]. This could be attributed to different ChatGPT endpoints and versions, or to different prompts used for evaluation.
Performance breakdown across different tasks of AGIEval and BBH is provided in Appendix A. Examples from each dataset with the response from Orca 2 is presented in Appendix F.
MMLU, ARC-Easy and ARC-Challenge assess the language understanding, knowledge and reasoning of LLMS. As with other benchmarks, we compare only to instruction-tuned models and conduct a zero-shot evaluation. Table 2 displays the results for knowledge and language
Table 2: Zero-Shot performance comparison of different models on MMLU, ARC Easy and ARC Challenge. System Message indicates if the system message was “empty” or “cautious”.
Overall, we observe similar trends as with the reasoning tasks:
We further note our baseline runs for this set of evaluations align with publicly reported results under zero-shot settings, considering the differences in prompts and possible variations in API endpoints for GPT models. We also point out that publicly reported results with LLaMA-2 models on MMLU are higher (54.8 and 68.9 for 13B and 70B variants, respectively [58]). However, these numbers are in few-shot settings, compared to the zero-shot settings reported in this paper.
While we did not perform a comprehensive few-shot evaluation of Orca 2, preliminary results on one task point to smaller gains (over zero-shot settings) for Orca 2 compared to LLaMA-2 models, especially when compared to the 70B base models. We discuss this in Section 7 and aim to study this further moving forward.
In addition to benchmarks measuring advanced reasoning capabilities, we also use HellaSwag and LAMBADA to measure text completion abilities. HellaSwag measures text completion skills in a multiple-choice question format, while LAMBADA is a single-word completion task.
Figure 5 shows the performance of different models on text completion benchmarks. Both Orca-2-7B and Orca-2-13B exhibit strong performance on HellaSwag outperforming the 13B and 70B baselines. Orca-2-13B achieves a relative improvement of 33.13% over LLaMA-2- Chat-13B and 61.94% over WizardLM-13B.
Figure 5: Performance of different models on text completion test sets in zero-shot setting.
We compare baseline results from our runs with publicly reported results and identify that on HellaSwag, LLaMA-2-13B has much higher performance than LLaMA-2-Chat-13B. We randomly sampled from LLaMA-2-Chat-13B and LLaMA-2-Chat-70B responses and manually reviewed them to find that indeed many of the answers were wrong, with several cases where the models refuse to answer citing safety concerns, sometimes incorrectly. We conjecture that chat models might not be best suited for text completion tasks like HellaSwag.
We also investigate the subpar performance of GPT-4 in the LAMBADA task. Our preliminary analysis shows that GPT-4 often claims that the context does not provide sufficient information to accurately identify the missing word or proposes a word that does not match the gold label.
Although GPT-4’s performance could be enhanced through prompt engineering, it appears that LAMBADA might need additional prompt engineering and may not be suitable for evaluating chat-optimized models.
We evaluate the capabilities of Large Language Models (LLMs) in multi-turn conversational settings, utilizing the MT Bench dataset [67]. MT-Bench initiates conversations with LLMs through predetermined inquiries. Each dialogue consists of an initial query (Turn 1) and a follow-up query (Turn 2). Notably, the follow-up query remains unaltered, irrespective of the LLM’s response to the opening query.
MT-Bench employs GPT-4 for evaluation purposes. For each turn, MT-Bench calculates a score ranging from 1 to 10 using GPT-4. The per-turn score and the average score on MT-Bench can be found in Table 3. We have examined different GPT-4 endpoints and discovered that they yield different assessments. This raises a question about the comparability of the results produced by different GPT-4 versions. To minimize potential issues, we have employed the same GPT-4 endpoint and version for conducting evaluations.
Table 3: MT-Bench scores per turn and average
Orca-2-13B performs comparably with other 13B models. The average second turn score of Orca-2-13B is lower than the first turn score, which can be attributed to the absence of conversations in its training data. However, Orca 2 is still capable of engaging in conversations, and this ability can be enhanced by packing multiple zero-shot examples into the same input sequence. It is part of our future work to improve Orca 2’s multi-turn conversational ability.
Generating responses that are grounded in specific context is a desired property for many LLM applications. We use three different tasks for this evaluation covering query-based meeting summarization, web question answering where answers are generated and have long format and doctor-patient conversation summarization. Abstractive summarization and grounded questions answering are frequently used as test beds to evaluate groundedness.
We use the grounding evaluation framework proposed in [34]. The framework uses GPT-4 as a judge to measure in-context groundedness. Note that using any model as a proxy for evaluation (including GPT-4) has limitations depending on the model, for example, if the model has tendency to favour samples with specific characteristics like its own generations, long text or specific order of samples [67, 60, 37]. Working on increasing consistency between human evaluation and LLM based evaluation is an open area of research [32, 15, 43, 34, 67].
Figure 6 presents hallucination rate results for different models averaged over three benchmarks we have conducted experiments on.
We note that Orca-2-13B exhibits the lowest rate of hallucination among all Orca 2 variants and other 13B and 70B LLMs. When compared with the LLaMA-2-13B and WizardLM-13B models, Orca-2-13B demonstrates a relative reduction of 76.92% and 61.71% in hallucination rate. Though cautious system message consistently increases the Hallucination Rate across the three tasks studied in this work. Through manual analysis, we found evidence that during the reasoning process led by cautious system message, Orca 2 might extrapolate the information available in the context, and uses the extrapolated content to create the summary. The ungrounded generated contents are often factually accurate, but they are not supported by the context. Examples of this situation for each of the datasets are presented in Appendix F.
A comprehensive evaluation of LLMs for safety is quite complex and is an open area of research. In this study, we have conducted experiments on the publicly available datasets described in section 5.2.6 and a recent measurement framework proposed in [34]. They represent tasks with a focus domain of implicit and explicit toxicity, truthfulness, three categories of content harms, three domains for IP and three areas of jailbreaks. Our experiments include two evaluation regimes for each model, a discriminative regime where the model is used as a classifier to identify the type of given content, and a generative regime where the model is prompted to produce output that does not follow safety guidelines and is then evaluated on how many times it follows the prompt intent. Please note that in both regimes, models might behave differently both in terms of not filtering the content correctly or filtering the content more aggressively which might result in erasure for different social groups.
Figure 6: The hallucination rate evaluated by GPT-4 as discriminator averaged over three abstractive summarization benchmarks described in section 5 (the lower the better). Task specific performance breakdown is reported in Table 11.
For discriminative evaluation we have used ToxiGen, HHH and TruthfulQA: ToxiGen: In the human annotated portion of this dataset each of the statements has a toxicity score provided by three human annotators. For the purpose of this experiment, we formulate the task as a classification task and assign each statement with a score lower than 2.5 as benign and greater than or equals to 2.5 as being toxic. The model is then prompted to score each of the statements. The model’s accuracy is calculated as the percentage of statements it classifies correctly as toxic or benign. Results are presented in Figure 7 and Figure 8.
From the experiments we observe that Orca-2-7B, WizardLM-13B, LLaMA-2-70B and Orca 1 models perform better at classifying toxic statements than classifying neutral statements. This is an important note as it might cause erasure (filtering out the content related to specific groups of people even if it is not problematic) for target identity groups in ToxiGen. Orca-2-13B, WizardLM-70B and LLaMA-2-13B do not have this problem for this experiment.
Note that we also conducted an experiment to ensure instruction following of various models for this experiment, i.e., making sure the model outputs the requested format. All models in Orca 2 family, LLaMA-2 family and WizardLM family had rates above 96%. They were relatively lower for Orca 1 model, 79%, which does not follow task instruction as well.
Performance breakdown for each of the categories in ToxiGen are presented in Appendix D.1.
TruthfulQA: For evaluation on this dataset we have used the multiple-choice variant of the dataset, TruthfulQA MC from EleutherAI, which includes questions from TruthfulQA in multiple choice format. Multiple choice style evaluation for TruthfulQA has also been used in [44]. There are related works that have used generative style evaluation for this dataset (e.g., [57]) using another model as judge which we have not used in this experiment.
Figure 8: ToxiGen evaluation results for neutral statement classification averaged over all the 13 categories.
The results are presented in Figure 9, where we observe that Orca-2-13B performs better in answering the questions compared to other models of similar size and comparable to models with much larger size.
Figure 9: Performance of different models on TruthfulQA benchmark. We report the accuracy as the percentage of times the model generated the correct answer to the given multiple choice questions.
Please note that the reason for the performance difference for both LLaMA-2-Chat-13B and LLaMA-2-Chat-70B from the ones reported in LLaMA-2 report [57] for TruthfulQA is that the evaluation schemes are different. In LLaMA-2, they report a generative style evaluation where GPT-3 has been used as annotator while we have used multiple choice version of the dataset to avoid limitations of using GPT-3 (or any LLM) as a judge (order bias, length bias, etc) when the gold labels are available.
HHH: The results for the HHH task are presented in Figure 10 where both Orca-2-13B and Orca-2-7B outperform models of similar sizes. We also observed that LLaMA-2-Chat-13B and LLaMA-2-Chat-70B models occasionally avoid answering HHH questions, responding with answers like: “I cannot fulfill your request.” Additionally, both LLaMA-2-Chat and Orca 2 models (w/ cautious systems message) tend to produce open-ended responses to some questions (especially longer ones) instead of choosing a specific answer. We believe that additional prompt engineering could help address some of these cases.
For generative style evaluation we have used the framework proposed in [34] and ToxiGen. It is important to note that any model used as annotator (including the ones that we have selected) is a proxy and may come with its own weaknesses and biases depending on the data it has been trained on. We recommend to take these experiments as an evaluation methodology and repeat them for the given domain of interest or application using the models and annotation approaches for the given domain:
Automated RAI Measurement Framework: We employ a recently proposed framework for automated measurement of Responsible AI metrics for LLMs [34]. In this framework, one LLM poses as a user and engages in a synthetic conversation with the LLM under test which is Orca 2 in the experiments we are presenting in this section. Another LLM then evaluates the responses to measure its tendency to violate Responsible AI guidelines from the following aspects: (1) How many times the user succeeded in Jailbreaking the model under the test? (2) How many times the model under the test generates Potentially Harmful Content? (3) How many times the model leaks Intellectual Property (IP)? (For example while responding to “The Great Gatsby book full text”). These categories are divided into multiple sub-categories described in [34]. Results of our experiments using this framework are presented in Table 4 and 5. Please note that the automated framework sets its own system message while assessing chat model’s safety. Please note that Orca 2 has not gone through RLHF safety training similar to LLaMA-2-Chat-13B and the tables 4 and 5 reflect some aspects of that. We hypothesize that since the teacher models (ChatGPT, GPT-4) have gone through various steps of safety tuning, and the use of content filtering in the data generation process, Orca 2 might have learned some aspects of those through instruction tuning. We probe some of the categories where we see a larger difference (e.g., violent) and observe that Orca 2 tends to counter the harmful positions more often (which is penalized by the metric), while models that have gone through RLHF safety training tend to decline to respond more often (which is rewarded by the metric).
Figure 10: Evaluation results for HHH dataset.
Table 5: Safety evaluation of Orca 2 with automated Responsible AI measurement framework [34], measured as defect rate for Jailbreak.
ToxiGen: In this experiment, we prompt the model with the test set of ToxiGen [16] dataset containing toxic and benign statements. The task for the model is to continue the text further. To compute the toxicity of the content generated, we use HateBERT [4], an off-the-shelf hate speech detector to generate toxicity probability with higher values signifying more toxicity in the content. We represent them as a number between 0 to 100 for better visibility. Results for generation using toxic and neutral prompts are presented in Figure 11.
Figure 12: Zero-Shot exact match performance on story reordering task of various models. GPT4 refers to default GPT-4 answer (Answer 1 in Figure 3), GPT4 w/ step-by-step corresponds to default step-by-step answer (Answer 3 in Figure 3), GPT4 w/ explain refers to answer then explain type of answer (Answer 2 in Figure 3). GPT4 w/ special step-by-step (Answer 4 in Figure 3) is used to generate data for Orca 2.
We create 5,000 training samples for story reordering using the prompt in Figure 3. We do not use the complex prompt during Orca 2 training (i.e. applying prompt erasing). We mix the task-specific data with the rest of the training dataset and evaluate Orca 2 on a distinct set of the ROCStories corpus [41]. While sampling the test instances, we remove any instances from ROCStories that are in FLAN training split to avoid contamination. Figure 12 compares the performance of Orca 2 with different system messages for GPT-4. It also captures the performance of ChatGPT, Orca 1, LLaMA and WizardLM models. This experiment highlights the potential of specializing Orca 2 models for specific tasks using synthetic data generated with prompt erasing.
Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models and limitations originating from Orca 2’s training process, including:
This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application.
https://learn.microsoft.com/en-us/legal/cognitive-services/openai/
Our study has demonstrated that improving the reasoning capabilities of smaller language models is not only possible, but also attainable through training on tailored synthetic data. Orca 2 models, by implementing a variety of reasoning techniques and recognizing the most effective solution strategy for each task, achieve performance levels comparable to, and often exceeding, models that are much larger, especially on zero-shot reasoning tasks. Though these models still exhibit limitations and constraints inherent to their base models, they show a promising potential for future improvement, especially in terms of better reasoning capabilities, control and safety, through the use of synthetic data for post-training. While Orca 2 models have not gone through RLHF training for safety, we believe that the use of synthetic data for post-training that has been filtered with various content safety filters could provide another opportunity for improving the overall safety of the models. While the journey towards fully realizing the potential of small language models is ongoing, our work represents a step forward, especially highlighting the value of teaching smaller models to reason. It also highlights the potential of using tailored and high-quality synthetic data, created by a more powerful model, for training language models using complex prompts and potentially multiple model calls. While frontier models will continue to demonstrate superior capabilities, we believe that research toward building more capable smaller models will help pave the way for new applications that require different deployment scenarios and trade offs between efficiency and capability.