00:00:00

Model | OLMo

https://dsdanielpark.github.io https://github.com/dsdanielpark

Model | OLMo

MinWoo(Daniel) Park | Tech Blog

Created: 2024-02-02 13:59:31 +0000

Last modified: 2024-09-05 20:56:50 +0900

Model | OLMo

Related Project: Private
Category: Paper Review
Date: 2024-02-02

OLMo: Accelerating the Science of Language Models

url: https://arxiv.org/abs/2402.00838
pdf: https://arxiv.org/pdf/2402.00838
medium_post: https://blog.allenai.org/dolma-3-trillion-tokens-open-TextGenerationLLM-corpus-9a0ff4b8da64
official_web: https://allenai.org/olmo
model: https://huggingface.co/allenai/OLMo-7B
dataset: https://huggingface.co/datasets/allenai/dolma
abstract: Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

Contents

OLMo: Accelerating the Science of Language Models

OLMo 프레임워크 및 아키텍처 설계, 프리트레이닝 데이터셋 Dolma 개발
OLMo의 모델 평가 방법과 벤치마크 결과 분석

데이터셋과 훈련, 평가 아티팩트의 전면 공개 및 그에 따른 기대 효과

Category	Resource
Weights	Hugging Face: OLMo-7B
Code	GitHub: OLMo Repository
Data	Hugging Face: allenai/dolma Dataset
Evaluation	GitHub: OLMo-Eval
Adaptation	GitHub: Open-Instruct

1. 서론

언어 모델은 NLP 기술의 핵심으로, 대규모 pre-training과 휴먼의 주석 작업을 통해 상업적 가치가 크게 증가하였습니다. 하지만 이런 모델들은 종종 기업의 독점적 인터페이스로 제한되어 있으며 중요한 세부 사항이 공개되지 않는 경우가 많습니다. 연구 커뮤니티가 언어 모델을 완전히 이해하고 이들의 강점과 약점, 편향성 및 위험성을 연구할 수 있도록 OLMo라는 완전 개방형 언어 모델과 프레임워크를 소개합니다. 이는 training dataset, 코드, 중간 체크포인트, 로그 등을 포함한 포괄적인 자료를 제공합니다.

2. OLMo 프레임워크

2.1 OLMo 모델 및 아키텍처

Transformer 구조는 주로 Vaswani et al. (2017)에 기반을 두고 있으며, 여러 최신 언어 모델에 채택된 변형을 포함합니다. 모델은 편향을 배제하고, 비파라메트릭 레이어 노름을 사용하여, 안정적인 훈련을 도모합니다. SwiGLU 활성화 함수와 RoPE 위치 인코딩을 사용하여 성능을 개선하였습니다.

수식을 더 명확하고 가독성 높게 표현하기 위해 다음과 같이 수정할 수 있습니다.

\[\text{Output} = \text{SwiGLU}(\text{Input})\]

SwiGLU 함수는 다음과 같이 정의됩니다.

\[\text{SwiGLU} = \sigma(\mathbf{W}_1 \mathbf{x}) \odot (\mathbf{W}_2 \mathbf{x})\]

이때, \(\sigma\)는 시그모이드 활성화 함수를 나타내며, \(\odot\)은 요소별 곱셈을 의미합니다. \(\mathbf{W}_1\)과 \(\mathbf{W}_2\)는 각각 입력 \(\mathbf{x}\)에 적용되는 가중치 행렬입니다. SwiGLU는 입력의 절반 크기로 출력을 생성하여 모델의 효율성을 증가시키는 것으로 알려져있습니다.

2.2 프리트레이닝 데이터: Dolma

데이터 구성 및 적용 방법 Dolma는 다양한 소스에서 3T 토큰을 포함하는 다양한 데이터셋입니다. 데이터의 질과 내용을 보장하기 위해 여러 단계의 필터링과 중복 제거 과정을 거쳐 최종적으로 모델 학습에 활용됩니다.

2.3 평가 방법

모델 평가는 Catwalk 프레임워크를 사용하여 진행됩니다. 퍼플렉시티(perplexity) 평가에는 Paloma 벤치마크가 사용되며, 이는 다양한 도메인의 텍스트를 포함합니다. 퍼플렉시티의 수학적 정의는 다음과 같습니다.

\[\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(w_i\\|w_{i-1},...,w_1)\right)\]

\(N\)은 단어의 수, \(p(w_i\\|w_{i-1},...,w_1)\)는 조건부 확률

3. OLMo 훈련

3.1 분산 훈련 프레임워크

OLMo 모델은 ZeRO 최적화 전략을 사용하여 GPU 메모리 소모를 줄이고, 효율적으로 대규모 데이터를 처리할 수 있도록 설계되었습니다. 이는 다음과 같은 최적화 과정을 포함합니다.

\[\text{Memory Reduction} = \frac{\text{Total Model Parameters}}{\text{Number of GPUs}}\]

3.2 최적화 방법

경사하강법 및 학습률 조정 AdamW 최적화는 다음과 같은 수학적 모델로 설명될 수 있습니다.

\[\theta_{t+1} = \theta_t - \eta_t \cdot \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}\]

\(\theta\)는 파라미터, \(\eta\)는 학습률, \(m_t\)와 \(v_t\)는 각각 1차 및 2차 모멘트 추정치

4. 결과 및 분석

OLMo-7B 모델은 여러 벤치마크에서 경쟁 모델과 비교하여 우수한 성능을 나타냈습니다. Perplexity 및 Downstream 평가 모두에서 높은 정확도를 보여줌으로써, 프리트레이닝 데이터와 최적화 전략의 유효성을 입증합니다.

5. 아티팩트 공개

분류	세부 항목 및 설명
프리트레이닝	1. 훈련 및 모델링 코드: 모든 훈련 및 모델링 관련 코드 공개 2. 훈련된 모델 가중치: 7B 모델, 7B-twin-2T, 1B 모델의 최종 및 중간 체크포인트(500+개, 1000단계마다) 공개 3. 훈련 중 기록된 메트릭스: 훈련 중 Weights & Biases에 기록된 모든 메트릭스 세트 공개
데이터	1. 프리트레이닝 데이터셋 Dolma: Dolma 전체 데이터셋 공개 2. 데이터 순서 복원 및 검사 도구: training dataset 순서 및 각 단계에서 본 데이터를 확인할 수 있는 도구 공개 3. 데이터셋 재생성 및 분석 도구: Dolma 데이터셋 재생성 및 분석을 위한 도구 공개
적응	1. 적응 훈련 코드 및 데이터: 모델 적응을 위한 훈련 코드 및 데이터 공개 2. 적응된 모델 가중치: OLMo+SFT 및 OLMo+SFT+DPO 모델의 가중치 공개
평가	1. 평가 프레임워크 코드 및 데이터: Catwalk 평가 프레임워크를 통한 오프라인 평가 코드 및 데이터 공개 2. 적응된 모델 평가 스위트: 적응된 모델을 평가하기 위한 평가 스위트(Wang et al., 2023; Ivison et al., 2023) 공개

1 Introduction

Language models have been at the center of NLP technologies for many years (Rosenfeld, 2000; Bengio et al., 2003; Mikolov et al., 2013; Peters et al., 2018; Brown et al., 2020). Recently, due to large-scale pretraining and human annotation for alignment, they have become commercially valuable (OpenAI, 2023). However, as their commercial value has increased, the largest models have become gated behind proprietary interfaces, with important details left undisclosed.

We believe that full access to open language models for the research community is critical to the scientific study of these models, their strengths and weaknesses, and their biases and risks. Accordingly, we introduce OLMo, a state-of-the-art, truly open language model and framework to build, study, and advance LMs, along with the training data, training and evaluation code, intermediate model checkpoints, and training logs.

Recent LM releases have varied in their degree of openness. For example, Mistral 8x7B provided model weights and a brief report (Jiang et al., 2024), while LLaMA came with in-depth adaptation training instructions (Touvron et al., 2023b), and Mosaic Pretrained Transformer came with many details, including the dataset distribution, though not the data itself (MosaicML NLP Team, 2023). Falcon’s pretraining data was partially released (Almazrouei et al., 2023), and the most open models—the Pythia suite (Biderman et al., 2023) and BLOOM (BigScience et al., 2022)—released training code, model checkpoints, training data and more.

OLMo releases the whole framework from data to training to evaluation tools: multiple training checkpoints across multiple hardware types, training logs, and exact datasets used, with a permissive license. We are not the only team to do this; recent work from LLM360 targets similar goals (Liu et al., 2023). OLMo narrows the gap from their models to state-of-the-art capabilities of models like LLaMA2. This project has benefited from lessons learned from all of these previous efforts with their varying degrees of openness, and we believe that a large, diverse population of open models is the best hope for scientific progress on understanding language models and engineering progress on improving their utility.

The OLMo framework encompasses the tools and resources required for building and researching language models. For training and modeling, it includes full model weights, training code, training logs, ablations, training metrics in the form of Weights & Biases logs, and inference code. This first release includes four variants of our language model at the 7B scale corresponding to different architectures, optimizers, and training hardware, and one model at the 1B scale, all trained on at least 2T tokens. We are also releasing hundreds of intermediate checkpoints available as revisions on HuggingFace. For dataset building and analysis, it includes the full training data used for these models, including code that produces the training data, from AI2’s Dolma (Soldaini et al., 2024), and WIMBD (Elazar et al., 2023) for analyzing pretraining data. For evaluation, it includes AI2’s Catwalk (Groeneveld et al., 2023) for downstream evaluation and Paloma (Magnusson et al., 2023) for perplexity-based evaluation. For instruction-tuning, we released Open Instruct (Ivison et al., 2023; Wang et al., 2023), and we are currently using it to produce an adapted (instruction-tuned and RLHFed) version of OLMo, which we will release soon. Finally, all code and weights are released under the Apache 2.0 License.

This is the first step in a long series of planned releases, continuing with larger models, instructiontuned models, and more modalities and variants down the line. We therefore hope to catalyze research into as-yet poorly understood aspects of these models, for example, the relationship between pretraining data and model capabilities, the impact of design and hyperparameter choices, and various optimization methods and their impact on model training. In addition, we report on the lessons learned and important details necessary to successfully train language models at this scale.

2 OLMo Framework

This section describes the OLMo framework, consisting of the OLMo models (Section 2.1), our pre-training dataset, Dolma (Section 2.2), and our evaluation framework (Section 2.3).

1 http://www.apache.org/licenses/LICENSE-2.0

2.1 OLMo Model and Architecture

We adopt a decoder-only transformer architecture based on Vaswani et al. (2017), and deliver 1B and 7B variants as described in Table 1, with a 65B version coming soon. Our specific architecture includes several improvements over the vanilla transformer from Vaswani et al. (2017) following other recent large language models like PaLM (Chowdhery et al., 2022), the LLaMA family (Touvron et al., 2023a,b), OpenLM (Gururangan et al., 2023), and Falcon (Almazrouei et al., 2023). Table 2 gives a comprehensive comparison of our 7B architecture to the similarly-sized models from these other families.

Table 1: OLMo model sizes and the maximum number of tokens trained to. * At the time of writing our 65B model is still training.

We generally select hyperparameters by optimizing for training throughput on our hardware while minimizing the risk of loss spikes and slow divergence. We ablate choices through our in-loop evaluation setting, given available computational sources (Section 2.3). Table 2 compares our design choices with recent state-of-the-art open language models. Our main changes over the vanilla transformer architecture can be summarized as follows:

No biases. Following LLaMA, PaLM, and others, we exclude all bias terms from our architecture in order to improve training stability.
Non-parametric layer norm. We use the non-parametric formulation of layer norm (Ba et al., 2016) in which there is no affine transformation within the norm, i.e. no “adaptive gain” (or bias). We believe this was the safest option and it was also the fastest compared to the other variants we considered: parametric layer norm and RMSNorm (Zhang and Sennrich, 2019).
SwiGLU activation function. Like LLaMA, PaLM, and others we use the SwiGLU activation function (Shazeer, 2020) instead of ReLU, and following LLaMA the activation hidden size is approximately 8 d, but increased to the closest multiple of 128 (e.g. 11,008 3 for our 7B model) to improve throughput.2
Rotary positional embeddings (RoPE). Like LLaMA, PaLM, and others we replace absolute positional embeddings with rotary positional embeddings (RoPE; Su et al., 2021).
Vocabulary. We use a modified version of the BPE-based tokenizer from GPT-NeoX-20B (Black et al., 2022) with additional tokens for masking personal identifiable information (PII). The final vocabulary size is 50,280. However, to maximize training throughput we increase the size of the corresponding embedding matrix in our model to 50,304 so that it’s a multiple of 128.

2.2 Pretraining Data: Dolma

Despite progress in access to model parameters, pretraining datasets are still not as open. Pretraining data are often not released alongside open models (let alone closed models) and documentation about such data is often lacking in detail that would be needed to reproduce or fully understand the work. This has made it difficult to support certain threads of language model research, such as understanding how training data impacts model capabilities and limitations. To facilitate open research on language model pretraining, we built and released our pretraining dataset, Dolma— a diverse, multi-source corpus of 3T tokens across 5B documents acquired from 7 different data

2 Since SwiGLU is a “gated” activation function, the output is half the size of the input. So technically our inputs to SwiGLU have a dimensionality of 2 × 11,008 = 22,016 for our 7B model.

Dimension Num heads Num layers MLP ratio Layer norm type Positional embeddings RoPE Attention variant Biases Block type Activation Sequence length Batch size (instances) Batch size (tokens) Weight tying sources that are (1) commonly seen in large-scale language model pretraining and (2) accessible to the general public (Soldaini et al., 2024). Table 3 provides a high-level overview of the amount of data from each source.

Table 2: LM architecture comparison at the 7–8B scale. In the “layer norm type” row, “parametric” and “non-parametric” refer to the usual layer norm implementation with and without adaptive gain and bias, respectively.

Table 3: Composition of Dolma.

Dolma is built using a pipeline of (1) language filtering, (2) quality filtering, (3) content filtering, (4) deduplication, (5) multi-source mixing, and (6) tokenization. We refer the reader to the Dolma report (Soldaini et al., 2024) for more details about its design principles, details about its construction, and a more detailed summary of its contents. The report provides additional analyses and experimental results from training language models on intermediate states of Dolma to share what we learned about important data curation practices, including the role of content or quality filters, deduplication, and mixing data from multiple sources. We keep documents from each source separate, both during curation as well as in the final release. We open-sourced our high-performance data curation tools; this toolkit can be used to further experiment on Dolma, reproduce our work, and enable fast and easy curation of pretraining corpora. Finally, we also open-sourced our WIMBD tool (Elazar et al., 2023) to help with dataset analysis.

2.3 Evaluation

We perform model evaluation at two stages: online evaluation to make decisions for model design and offline evaluation to evaluate model checkpoints. For offline evaluation, we use the Catwalk framework (Groeneveld et al., 2023), our publicly available evaluation tool with access to a wide range of datasets and task formats. Using Catwalk, we perform downstream evaluation as well as intrinsic language modeling evaluation on our new perplexity benchmark, Paloma (Magnusson et al., 2023).

For both downstream and perplexity evaluation, we use our fixed evaluation pipeline to compare results against several publicly available models.

In-Loop Training Ablations Throughout model training, we perform downstream evaluations to make decisions around model architecture, initialization, optimizers, learning rate schedule, and data mixtures. We call this our online evaluation as it runs in-loop every 1000 training steps (or ∼4B training tokens) and provides an early and continuous signal on the quality of the model being trained. These evaluations rely on many of the core tasks and experiment settings used for our offline evaluation detailed in Section 4.1, which also mirrors the task and evaluation structure of the EleutherAI eval harness (Gao et al., 2023).

Downstream Evaluation Following much previous work (Brown et al., 2020; Black et al., 2022; Touvron et al., 2023a,b, inter alia), we report zero-shot performance on a set of downstream tasks. Our evaluation suite consists of 9 core tasks corresponding closely to the commonsense reasoning task set reported by Touvron et al. (2023a) and Touvron et al. (2023b) (see Table 6 for a list of tasks). Given the scale of the models being evaluated, such tasks were selected at the beginning of model development due to their naturalness (e.g., all can formulated as text completion scoring tasks) and ability to provide meaningful signals throughout training (see Figure 1).

Intrinsic Language Modeling Evaluation To measure how OLMo-7B fits distributions of language beyond held-out training data, we use Paloma (Magnusson et al., 2023), a new perplexity benchmark that includes 585 different domains of text. Domains range from nytimes.com to r/depression on Reddit and are drawn from 18 separate data sources, such as C4 (Raffel et al., 2020), in stratified samples. This allows for more equal inclusion of text domains that are under-represented in their source corpora.

We aim not just to compare OLMo-7B against other models for best performance, but also to demonstrate how it enables fuller and more controlled scientific evaluations. OLMo-7B is the largest LM with explicit decontamination for perplexity evaluation. Following the approach described in Paloma, we remove any pretraining document with paragraphs leaked from Paloma evaluation data. Without decontamination, other models risk underestimating perplexity (i.e., overestimating the model’s out-of-sample fit). We also release intermediate checkpoints, allowing richer comparisons with two other models that release checkpoints, Pythia-6.9B (Biderman et al., 2023) and RPJ-INCITE-7B (Together Computer, 2023) (see Figure 2).

3 Training OLMo

This section describes our pretraining setup, including our distributed training framework (Section 3.1), optimizer settings (Section 3.2), data preparation (Section 3.3), and hardware (Section 3.4).

3.1 Distributed Training Framework

We train our models using the ZeRO optimizer strategy (Rajbhandari et al., 2019) via PyTorch’s FSDP framework (Zhao et al., 2023), which reduces memory consumption by sharding the model weights and their corresponding optimizer state across GPUs. At the 7B scale, this enables training with a micro-batch size of 4096 tokens per GPU on our hardware (see Section 3.4). For OLMo-1B and -7B models, we use a constant global batch size of approximately 4M tokens (2048 instances, each with a sequence length of 2048 tokens). For OLMo-65B model (currently training), we use a batch size warmup that starts at approximately 2M tokens (1024 instances), then doubles every 100B tokens until reaching approximately 16M tokens (8192 instances).

Table 4: AdamW pretraining hyperparameters for OLMo models. * At the time of writing our 65B model is still training.

To improve throughput, we employ mixed-precision training (Micikevicius et al., 2017) through FSDP’s built-in settings and PyTorch’s amp module. The latter ensures that certain operations like the softmax always run in full precision to improve stability, while all other operations run in halfprecision with the bfloat16 format. Under our specific settings, the sharded model weights and optimizer state local to each GPU are kept in full precision. The weights within each transformer block are only cast to bfloat16 when the full-sized parameters are materialized on each GPU during the forward and backward passes. Gradients are reduced across GPUs in full precision.

3.2 Optimizer

We use the AdamW optimizer (Loshchilov and Hutter, 2019) with the hyperparameters shown in Table 4. For all model sizes, we warm up the learning rate over 5000 steps (∼21B tokens) and then decay it linearly from there down to a tenth of the peak learning rate over the remainder of training. After the warm-up period, we clip gradients such that the total l2-norm of the parameter gradients3 does not exceed 1.0. Table 5 gives a comparison of our optimizer settings at the 7B scale to those of other recent LMs that also used AdamW.

3.3 Data

We built our training dataset out of a 2T-token sample from our open dataset, Dolma (Soldaini et al., 2024), which we describe in Section 2.2. The tokens from every document are concatenated together after appending a special EOS token to the end of each document, and then we group consecutive chunks of 2048 tokens to form training instances. The training instances are shuffled in the exact same way for each training run. The data order and exact composition of each training batch can be reconstructed from the artifacts we release.

All of our released models have been trained to at least 2T tokens (a single epoch over our training data), and some have been trained beyond that by starting a second epoch over the data with a different shuffling order. The impact of repeating this small amount of data should be negligible according to prior work (Muennighoff et al., 2023).

3.4 Hardware

In order to verify that our codebase could be used on both NVIDIA and AMD GPUs without any loss in performance, we trained models on two different clusters:

• LUMI: Provided by the LUMI supercomputer,4 we used up to 256 nodes on this cluster, where each node consists of 4x AMD MI250X GPUs with 128GB of memory5 and 800Gbps of interconnect.

3 During gradient clipping all of the model’s parameters are treated as a single big vector (as if all parameters were flattened and concatenated together), and we take the ℓ2-norm over the corresponding single gradient vector. This is the standard way to clip gradients in PyTorch.

4 https://www.lumi-supercomputer.eu 5The MI250X is a dual-chip module, meaning in practice that each physical device consists of two logical devices, so each node has 8 logical GPU devices with 64GB of memory each.

Table 5: Comparison of pretraining optimizer settings at the 7B scale. Each model in this table used AdamW as its optimizer.

• MosaicML: Provided by MosaicML6 (Databricks), we used 27 nodes on this cluster, where each node consists of 8x NVIDIA A100 GPUs with 40GB of memory and 800Gbps interconnect.

Despite minor differences in batch size to optimize for training throughput, both runs resulted in nearly identical performance on our evaluation suite by 2T tokens.

4 Results

The checkpoint used for evaluating OLMo-7B is trained until 2.46T tokens on the Dolma (Soldaini et al., 2024) dataset with a linear learning rate decay schedule mentioned in Section 3.2. In our experiments, we find that tuning this checkpoint further on Dolma dataset for 1000 steps with the learning rate linearly decayed to 0 boosts model performance on perplexity and end-task evaluation suites described in Section 2.3. We compare OLMo with other publicly available models including LLaMA-7B (Touvron et al., 2023a), LLaMA2-7B (Touvron et al., 2023b), MPT-7B (MosaicML NLP Team, 2023), Pythia-6.9B (Biderman et al., 2023), Falcon-7B (Almazrouei et al., 2023) and RPJ-INCITE-7B (Together Computer, 2023).

4.1 Downstream evaluation

Setup Our core downstream evaluation suite (see Table 6) consists of: arc (both arc easy and arc challenge) (Clark et al., 2018), boolq (Clark et al., 2019), openbookqa (Mihaylov et al., 2018), sciq (Welbl et al., 2017), hellaswag (Zellers et al., 2019), piqa (Bisk et al., 2020), copa (Roemmele et al., 2011) and winogrande (Sakaguchi et al., 2021). In Appendix A, we also report results on an additional set of auxiliary tasks outside of our core evaluation set that we found to have less stable performance trends (see Figure 4). We note that our downstream evaluation suite is still under development and that additional results and analysis will be reported in a future version.

In all cases, we perform zero-shot evaluation using the rank classification approach popularized by Brown et al. (2020). Under this approach, candidate text completions (e.g., different multiple-choice options) are ranked by likelihood (usually normalized by some normalization factor), and prediction accuracy is reported. While Catwalk implements several common likelihood normalization strategies, including normalizing by number of tokens (per-token normalization) (Brown et al., 2020; Liang et al., 2022), by number of characters (per-character normalization) (Gao et al., 2023), as well as incorporating an answer’s unconditional likelihood (Brown et al., 2020), we selected the normalization strategies for each dataset separately. Specifically, we used unconditional normalization for arc and openbookqa, per-token normalization for hellaswag, piqa, and winogrande and no normalization for boolq, copa, and sciq (i.e., tasks formulated as single token prediction tasks).

6 https://www.mosaicml.com

Table 6: Zero-shot evaluation of OLMo-7B and 6 other publicly available comparable model checkpoints on 9 core tasks from the downstream evaluation suite described in Section 2.3. For OLMo-7B, we report results for the 2.46T token checkpoint.

Figure 1: Accuracy score progression of OLMo-7B on 9 core end-tasks score from Catwalk evaluation suite described in Section 2.3. We can see the benefit of decaying LR to 0 in the final 1000 steps of training on 7/9 end-tasks.

Results Table 6 summarizes the result of zero-shot evaluation of OLMo-7B and compares it against 6 other publicly available models of comparable size. We report results on 9 core tasks from our evaluation suite described in Section 2.3. Our OLMo-7B checkpoint outperforms all other publicly available models on 2 end-tasks and remains in top-3 on 8/9 end-tasks from the evaluation suite. On aggregate, OLMo-7B is competitive against all 6 publicly available model checkpoints in our comparison table.

In Figure 1 we plot the accuracy score progression of 9 core end-tasks. All tasks, except OBQA, show an upward trend in accuracy numbers as OLMo-7B is trained on more tokens. A sharp upward tick in accuracy of many tasks between the last and the second to last step shows us the benefit of linearly reducing the LR to 0 over the final 1000 training steps. See Table 8 in Appendix A for additional evaluation results and discussion.

4.2 Intrinsic language modeling evaluation

Setup For intrinsic evaluations, Paloma proposes a range of analyses, from inspection of performance in each domain separately to more summarized results over combinations of domains. We report results at two levels of granularity: the aggregate performance over 11 of the 18 sources in Paloma as in Magnusson et al. (2023), as well as more fine-grained results over each of these sources individually. This particular subset of 11 sources from Paloma excludes sources that are not publicly available, involve fringe or toxic text, or consist of code data not supported by Paloma’s decontamination approach. This leaves C4 (Raffel et al., 2020), mC4-en (Chung et al., 2023), Wikitext 103 (Merity et al., 2016), Penn Treebank (Marcus et al., 1999; Nunes, 2020), RedPajama (Together Computer, 2023), Falcon-RefinedWeb (Penedo et al., 2023), Dolma (Soldaini et al., 2024), M2D2 S2ORC (Reid et al., 2022), M2D2 Wikipedia (Reid et al., 2022), C4 100 domains (Chronopoulou et al., 2022), and Dolma 100 Subreddits (Soldaini et al., 2024). To allow for a fair comparison between models with different vocabularies, we report bits per byte as defined by Gao et al. (2020) over the test sets of these sources.

Figure 2: Bits per byte on 11 evaluation data sources from Paloma and their combination (Magnusson et al., 2023), decontaminated from OLMo’s pretraining data. While models follow a general data scaling trend, sample efficiency is most favorable on in-distribution data. For example, OLMo-7B overtakes all other models on C4, perhaps from having 88.8% Common Crawl pretraining data.

In the Sources Combined subplot of Figure 2, we show the performance of OLMo-7B Results against 6 comparably-sized language models on the combination of 11 data sources from Paloma. Overall we find OLMo to have a competitive fit, especially given its training data was explicitly decontaminated against Paloma. As seen through the comparison of final models (see shapes) as well intermediate checkpoints (see dashed lines), the OLMo results follow similar scaling trends of other models. Note that the performance of intermediate checkpoints is influenced by where that checkpoint occurs in the learning rate schedule. So models trained for fewer steps will tend to have steeper training curves without necessarily being more sample efficient if training duration were fixed across all models. MPT-7B, nevertheless, stands out as improving ahead of the other models in this subplot. This could be due to a number of factors, including pretraining data composition and its match to the domains in Paloma (e.g., MPT trains on 27% non-Common Crawl data rather than 18% for LLaMA, 12.2% for RedPajama, and 11.2% for OLMo) as well as various data preprocessing decisions (e.g., MPT’s use of semantic deduplication by Abbas et al., 2023, on C4).

The remaining subplots in Figure 2 provide more fine-grained analysis by reporting bits per byte separately for each of the 11 data sources that are combined in the aggregated Paloma metric. From this we see greater variation in sample efficiency, largely driven by the similarity of training and evaluation distributions. Notably, OLMo-7B fares well on evaluations predominated by Common Crawl, such as C4, though different ways of postprocessing Common Crawl are best fit by models trained with that specific data, such as Falcon-7B on Falcon RefinedWeb. Meanwhile, OLMo-7B is less sample efficient compared to other models on sources less related to scraped web text, such as WikiText-103, M2D2 S2ORC, and M2D2 Wikipedia. The RedPajama evaluation shows a similar pattern, perhaps as only 2 of its 7 domains are from Common Crawl, and Paloma weights domains within each source equally. Since heterogeneous data from curated sources like Wikipedia and ArXiv papers is much less abundant than scraped web text, maintaining sample efficiency for fit to these distributions of language will be challenging as pretraining corpora are scaled.

4.3 Power Consumption and Carbon Footprint

Following previous literature (Strubell et al., 2019; Patterson et al., 2021; Wu et al., 2022; Dodge et al., 2022), we estimate the total energy consumed and carbon released while pretraining our models by calculating the total power consumption required for training, and then multiplying it by the carbon emission intensity of the power grid where the model was trained. While reporting these operational emissions is standard practice, it does not account for other sources of emissions such as the embodied emissions due to the manufacturing, transportation and disposal of hardware and datacenter infrastructure, lifetime operational emissions due to use, rebound effects, or other environmental impacts such as water consumption or mining. Thus our estimates should be viewed as lower bounds.

We calculate the total power consumption for our models by measuring the power consumption of a single node every 25ms, calculating an average across the entire training run, and multiplying by the total number of nodes. We then account for the energy efficiency of the data center by multiplying the previous total by a power usage effectiveness (PUE) factor, which we set to 1.1, representing a conservative 10% energy consumption overhead typical of energy efficient datacenters.78 We estimate that pretraining our 7B models consumed 239 MWh of energy.

To calculate carbon emissions, we multiply the total power consumption by a carbon intensity factor, measured in kg CO2 emitted per KWh, based on the physical location of the data center where each model was trained. The model trained on A100-40GB GPUs was trained in Australia, so we assume a carbon intensity factor of 0.610, the national average for Australia in 2022.9 The model trained on MI250X GPUs was trained in the LUMI supercomputer, which runs on 100% renewable, carbon-neutral energy, so we assume a carbon intensity factor of 0. LUMI is powered entirely by hydroelectric power and some sources (Ubierna et al., 2022) measure the carbon intensity factor of hydroelectric power to be 0.024, which would imply total carbon emissions of 3.54 tCO2eq.10 However, we rely on the official LUMI data for our calculations, and thus we estimate total pretraining emissions of 69.78 tCO2eq.11 In Table 7 we compare our models with other previously released models based on publicly available information.

We hope that openly releasing our models can reduce future emissions by allowing others to avoid the need to pretrain models from scratch, and give insights into the true cost of developing state of the art models. We also highlight that our estimates are lower bounds, because they do not include other critical pieces of development such as debugging, hyperparameter tuning, and downtime.

7 https://www.nrel.gov/computational-science/measuring-efficiency-pue.html

8 https://www.google.com/about/datacenters/efficiency/

9 https://www.cleanenergyregulator.gov.au/Infohub/Markets/Pages/qcmr/ december-quarter-2022/Emissions-Reduction.aspx

10 https://www.lumi-supercomputer.eu

11 These metrics were in part collected using Carbonara’s AI agent and monitoring platform. Learn more at: https://trycarbonara.com

Table 7: CO2 emissions during pretraining. We estimate the total carbon emissions for various models using publicly available data on PUE, carbon intensity of local power grid, and reported power consumption. Numbers for Gopher-280B (Rae et al., 2022), BLOOM-176B (Luccioni et al., 2022), OPT-175B (Zhang et al., 2022), T5-11B (Patterson et al., 2021), LLaMA (Touvron et al., 2023a), and LLaMA2 (Touvron et al., 2023b) are taken from their respective papers. See Section 4.3 for details on how tCO2eq was calculated. * LUMI runs entirely on hydroelectric power11and some estimates (Ubierna et al., 2022) measure the intensity factor of hydroelectric power to be 0.024, implying total emissions of 3.54 tCO2eq.

5. Artifacts Released

By sharing artifacts from all pipeline stages, we aim to encourage open research and reduce duplicated, often costly efforts, by academics and practitioners. We release the following:

Pretraining (Section 2.1)
1. The training and modeling code.
2. The trained model weights for the 7B model, 7B-twin-2T, and the 1B model. For all the models, we release not only the final model weights but also 500+ intermediate checkpoints at intervals of 1000 steps. Touvron et al. (2023b) report that Llama2 was pretrained on data contaminated with MMLU test data.
3. The complete set of metrics logged to Weights & Biases during training.
Data (Section 2.2)
1. Our full pretraining corpus Dolma (Soldaini et al., 2024).
2. Tools to support reproduction of full training data order as well as inspection of which training data was seen at each step during training.
3. Tools for recreating our training data (Soldaini et al., 2024) and performing dataset analysis (Elazar et al., 2024).
Adaptation (Section 2.3)
1. The training code and data for adaptation.
2. The model weights for OLMo+SFT and OLMo+SFT+DPO.
Evaluation (Section 2.4)
1. The code and data in our evaluation framework Catwalk (Groeneveld et al., 2023) for offline evaluation on both downstream tasks and intrinsic language modeling (Magnusson et al., 2023).
2. The evaluation suite (Wang et al., 2023; Ivison et al., 2023) for adapted models.

12 https://github.com/allenai/OLMo

13 https://huggingface.co/allenai/OLMo-7B

14 https://huggingface.co/allenai/OLMo-7B-Twin-2T

15 https://huggingface.co/allenai/OLMo-1B

16 https://huggingface.co/datasets/allenai/dolma

17 https://github.com/allenai/dolma

18 https://github.com/allenai/wimbd

19 https://github.com/allenai/OLMo-Eval

20 https://github.com/allenai/catwalk

21 https://paloma.allen.ai

22 https://github.com/allenai/open-instruct

6 License

Our goal is to facilitate scientific development and empower the scientific community, so we favor permissive licenses that give users flexibility in using our resources and artifacts. As such, all code and weights are released under the Apache 2.0 License.23 Some licenses used by other organizations for recent model releases prohibit using the outputs from their models to train artificial intelligence or machine learning systems, while we expressly allow users to do so. We also do not limit commercial use. We hope that our models can make other models better. We recognize that the risk for misuse of our models is relatively low, as language models that have not been adapted as chatbots have primarily been used as scientific artifacts not as products with broad public adoption (our models have not been adapted as chatbots). In addition, over the past year there have been a number of comparable models released with very permissive licenses, so using a more strict license for our work will not remove the overall risk in the field. We believe this tradeoff on the side of being more open is the best option.

7 Conclusion and Future Work

This technical report presents our first release of OLMo, a state-of-the-art, truly open language model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. Soon, we will also release training logs, ablations, findings and Weights & Biases logs. We are also exploring the adaptation of OLMo with instruction tuning and different flavors of RLHF. We are going to release the adapted models as well as all of our model adaptation code and data.

We intend to continuously support and extend OLMo and its framework, and continue to push the boundaries of open LMs to empower the open research community. To that end, we look forward to bringing different model sizes, modalities, datasets, safety measures, and evaluations into the OLMo family. We hope this and future releases will empower and strengthen the open research community and inspire a new wave of innovation.

post contain ""

No matching posts found containing ""

Model | OLMo

Model | OLMo

Model | OLMo

OLMo: Accelerating the Science of Language Models

1 Introduction

2 OLMo Framework

2.1 OLMo Model and Architecture

2.2 Pretraining Data: Dolma

2.3 Evaluation

3 Training OLMo

3.1 Distributed Training Framework

3.2 Optimizer

3.3 Data

3.4 Hardware

4 Results

4.1 Downstream evaluation

4.2 Intrinsic language modeling evaluation

4.3 Power Consumption and Carbon Footprint

5. Artifacts Released

6 License

7 Conclusion and Future Work

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Model | OLMo

Model | OLMo

OLMo: Accelerating the Science of Language Models

1 Introduction

2 OLMo Framework

2.1 OLMo Model and Architecture

2.2 Pretraining Data: Dolma

2.3 Evaluation

3 Training OLMo

3.1 Distributed Training Framework

3.2 Optimizer

3.3 Data

3.4 Hardware

4 Results

4.1 Downstream evaluation

4.2 Intrinsic language modeling evaluation

4.3 Power Consumption and Carbon Footprint

5. Artifacts Released

6 License

7 Conclusion and Future Work

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views