00:00:00

Share Your Feedback 🏝️

Model | Dolma*

Model | Dolma*

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Rephrasing Web Next: Model | OLMo

Model | Dolma*

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-02-01

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

  • url: https://arxiv.org/abs/2402.00159
  • pdf: https://arxiv.org/pdf/2402.00159
  • medium_post: https://blog.allenai.org/dolma-3-trillion-tokens-open-TextGenerationLLM-corpus-9a0ff4b8da64
  • model: https://huggingface.co/datasets/allenai/dolma
  • dataset: https://huggingface.co/datasets/allenai/dolma
  • abstract: Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

Contents

TL;DR


  1. 데이터 재구성을 통한 대규모 언어모델 성능 최적화
  2. 고품질 웹 데이터 재구성으로 모델 학습 효율성 증대
  3. 실험을 통해 재구성 데이터의 효과와 최적화 방법 검증

Medium Post: AI2 Dolma: 3 Trillion Token Open Corpus for Language Model Pretraining

  • Allen Institute for AI는 개방형 언어 모델 OLMo를 개발하고 공개하였습니다.
  • 주요 프로젝트 목표는 아티팩트와 문서화 프로세스를 공개하는 것이였으며, 첫 번째 아티팩트인 Dolma 1을 출시하였습니다.
    • Dolma-1은 다양한 웹 콘텐츠, 학술 출판물, 코드, 도서 및 백과사전 자료가 혼합된 3조 개의 토큰 데이터셋를 의미하며, ImpACT 라이선스에 따라 공개되었습니다.

1 서론

대규모 언어모델의 pre-training 언어 모델은 요약, 질문 응답 등 다양한 자연어 처리 작업에서 중심 역할을 합니다. 대부분의 강력한 언어 모델은 소수의 기관에서 개발되며, 이들은 모델 개발 세부 사항을 공개하지 않습니다. 특히, 언어 모델 pre-training 데이터의 구성은 모호하게 설명되어 과학적 진전 및 대중과의 소통에 어려움을 겪습니다. 이에 따라, 3조 토큰 데이터셋과 이를 재현하고 확장할 수 있는 도구를 공개하여 언어 모델 연구에 더 많은 사람들이 참여할 수 있도록 합니다.

데이터 투명성과 오픈 소스의 필요성 데이터 투명성은 언어 모델에 의존하는 애플리케이션 개발자와 사용자에게 중요한 결정을 내리는 데 도움이 됩니다. training dataset에서 문서나 용어의 빈도가 높을수록 관련 작업에서 성능이 향상되며, 사회적 편향이 존재할 수 있습니다. 오픈된 pre-training 데이터는 데이터 구성과 모델 행동 간의 관계를 연구할 수 있게 하여 현재의 데이터 큐레이션 관행을 개선할 수 있습니다. 또한, 데이터 접근은 새로운 언어 모델 개발에 필수적입니다.

Dolma: 공개된 대규모 pre-training 데이터셋 언어 모델 pre-training 연구를 지원하기 위해 3조 토큰으로 구성된 Dolma 코퍼스를 제안합니다. Dolma는 웹 텍스트, 과학 연구, 코드, 공공 도서, 소셜 미디어 게시물 등 다양한 출처에서 데이터를 수집하여 높은 품질과 다양한 데이터 구성을 제공합니다. 이를 통해 Dolma는 LLaMA, GPT-3 등과 비교하여 더 큰 토큰 풀을 제공하며, OLMo 모델 학습에 사용되었습니다.

Dolma 설계 원칙 Dolma를 설계할 때 평가 도구를 현명하게 사용하고, 연구 관심 방향을 진전시키는 결정을 선호합니다. OLMo 프로젝트의 일환으로 개발된 평가 도구는 pre-training 중 다양한 기능과 작업에 대한 지침을 제공합니다. 또한, Dolma는 향후 코드에 대한 pre-training의 영향을 조사하기 위해 설계되었습니다.


2 Dolma의 설계 목표

일관성, 규모, 개방성, 위험 최소화 Dolma는 기존의 언어 모델 pre-training 레시피와 일관성을 유지해야 합니다. 이는 데이터 소스와 처리 방법을 일치시켜 연구 커뮤니티가 현재 개발 중인 언어 모델을 연구하고 비판할 수 있도록 합니다. 또한, Dolma는 대규모 모델 학습을 지원할 수 있어야 하며, 이는 최신 연구에서 제안된 스케일링 법칙에 따라 최소 2~3조 토큰을 필요로 합니다. 마지막으로, Dolma는 공개된 코퍼스여야 하며, 개인에게 해를 끼치지 않도록 설계되어야 합니다.

2.1 개방성 및 기존 작업과의 일관성

Dolma는 기존의 랭귀지 모델 전처리 레시피와 일치하도록 설계되었습니다. 이는 연구 커뮤니티가 현재 개발 중인 랭귀지 모델을 연구하고 비판적으로 분석할 수 있도록 합니다. 또한, 영어 텍스트에 초점을 맞추어 광범위한 과학적 작업에 대한 일반성을 높이고 있습니다.

2.2 대규모 모델 지원

최근 연구에 따르면, 랭귀지 모델의 크기와 트레이닝 토큰 수의 비율을 유지하면서 모델을 최적화할 수 있다고 합니다. Dolma는 충분히 큰 데이터셋을 제공하여 모델과 데이터셋 크기 간의 관계를 더 연구할 수 있도록 합니다.

2.3 개방형 코퍼스로서의 기여

트레이닝 코퍼스와 함께 공개되는 모델은 드뭅니다. Dolma는 이런 한계를 극복하고자 다양한 데이터 소스로부터 데이터를 수집하고, 과정을 문서화하여 다른 연구자들이 재현하고 새로운 코퍼스를 생성할 수 있도록 지원합니다.


3 Dolma 생성

데이터 파이프라인 Dolma 생성에는 다양한 소스의 원시 데이터를 단일 컬렉션으로 정리된 문서로 변환하는 복잡한 파이프라인이 필요합니다. 이 파이프라인은 다양한 소스에서 콘텐츠를 수집하고, 데이터 정리를 통해 최종 데이터셋으로 혼합해야 합니다. 효율적인 처리를 위해 고성능 도구를 사용하여 수백 테라바이트의 텍스트 콘텐츠를 처리합니다.

Dolma 데이터셋은 다양한 소스로부터 수집된 방대한 양의 텍스트를 처리하기 위해 복잡한 파이프라인을 사용합니다. 이런 파이프라인은 다음과 같은 여러 단계로 구성됩니다.

  1. 언어 필터링: 특정 언어로 필터링하기 위해 자동 언어 식별 도구를 사용합니다.
  2. 품질 관리: Gopher와 C4의 휴리스틱을 조합하여 높은 품질의 데이터만을 유지합니다.
  3. 내용 필터링: 독성이나 부적절한 내용을 식별하여 제거합니다.
  4. 중복 제거: URL, 문서 및 단락 수준에서 중복을 제거하여 데이터 효율성을 높입니다.

이런 방법은 Dolma 데이터셋이 다양한 자연어 처리 작업에 대한 강력한 기반을 제공하도록 합니다. 이와 같은 체계적 접근 방식은 랭귀지 모델의 품질과 다양성을 향상시키는 데 결정적인 역할을 합니다.

언어 필터링 Dolma의 영어 전용 코퍼스를 생성하기 위해 자동화된 언어 식별 도구를 사용합니다. FastText의 언어 ID 모델을 사용하여 영어 점수가 낮은 문서를 제거합니다.

품질 필터링 웹 크롤링 데이터는 pre-training에 사용되기 전에 상당한 정리가 필요합니다. 이를 위해 C4와 Gopher의 규칙을 조합하여 품질을 필터링합니다.

\[\text{Perplexity} = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{1:i-1})\]

위의 수식은 모델의 퍼플렉시티를 계산하여 품질을 평가합니다.

콘텐츠 필터링 Dolma는 유해하거나 개인 식별 정보(PII)를 포함한 콘텐츠를 필터링합니다. 이를 위해 정규 표현식과 FastText 분류기를 사용하여 유해 콘텐츠와 PII를 제거합니다.

독성 콘텐츠를 식별하고 필터링하는 것은 모델이 윤리적으로 안전한 방식으로 생성되도록 하며, 필터링 점수는 다음과 같이 계산됩니다.

\[p(y \\| x) = \frac{1}{1 + e^{-\left(w \cdot x + b\right)}}\]

\(x\)는 입력 텍스트의 특성 벡터, \(w\)는 가중치, \(b\)는 바이어스, \(y\)는 텍스트가 독성이 있는지 여부의 레이블입니다. \(p(y \\| x)\)는 주어진 텍스트가 독성을 가질 확률을 나타냅니다. 이 식을 사용하여 설정된 임계값 이상에서 텍스트를 필터링합니다.

중복 제거 데이터의 중복 제거는 모델 트레이닝의 효율성을 크게 향상시킵니다. Dolma에서는 다음의 중복 제거 알고리즘을 적용하였습니다.

\[D_{\text{unique}} = 1 - \frac{\\|S_{\text{dup}}\\|}{\\|S_{\text{total}}\\|}\]

\(D_{\text{unique}}\)는 중복되지 않은 데이터의 비율, \(S_{\text{dup}}\)는 중복된 문서 집합, \(S_{\text{total}}\)은 전체 문서 집합입니다. 이 공식을 통해 각 데이터 소스에서 중복을 얼마나 제거했는지 정량적으로 평가할 수 있습니다.


4 도메인 적합도 측정과 파라미터 스케일링

랭귀지 모델의 성능을 향상시키는 주요 요소 중 하나는 도메인 적합도와 모델 파라미터의 적절한 스케일링입니다. 이를 위해, Dolma 프로젝트에서는 다음과 같은 수학적 모델을 사용하여 트레이닝 데이터와 모델 크기 사이의 최적 비율을 결정하였습니다.

\[C = \alpha \cdot \log(T) - \beta \cdot \log(P) + \gamma\]

\(C\)는 모델의 복잡성, \(T\)는 트레이닝 토큰 수, \(P\)는 파라미터 수, 그리고 \(\alpha\), \(\beta\), \(\gamma\)는 실험을 통해 추정된 계수입니다. 이 식은 트레이닝 토큰과 모델 크기 사이의 밸런스를 최적화하기 위한 지침을 제공합니다.


5 pre-training 모델 평가

벤치마크 디컨테미네이션 전략 pre-training 데이터에서 벤치마크 데이터를 제거하는 다양한 접근 방식을 실험합니다. 이는 모델의 성능을 평가하는 데 중요한 요소입니다.

데이터 혼합 전략 Dolma는 다중 소스 데이터셋으로, 각 소스에서 얼마나 많은 데이터를 포함할지 결정하는 혼합 전략이 필요합니다. 다양한 혼합 전략을 평가하여 최적의 성능을 달성할 수 있는 방법을 연구합니다.

Olmo-1b 모델 평가 Dolma를 사용하여 학습된 Olmo-1b 모델을 평가합니다. Olmo-1b는 다른 유사한 크기의 모델과 비교하여 전반적으로 더 나은 성능을 보였습니다. 특히, Dolma는 C4, RedPajama 등과 비교하여 더 높은 퍼플렉시티 효율성을 보여줍니다.

\[\text{Perplexity} = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \\| w_{1:i-1}) \right)\]

위의 수식을 통해 다양한 벤치마크 데이터에서 Olmo-1b의 퍼플렉시티를 평가합니다.


6 결론

Dolma는 다양한 고품질 데이터 소스를 활용하여 대규모 언어모델의 성능을 최적화하는 혁신적인 접근 방식을 제시합니다. 데이터 재구성을 통해 학습 효율성을 크게 향상시켰으며, 이를 실험적으로 검증합니다.


1 Introduction

Language models are now central to tackling myriad natural language processing tasks, including few-shot learning, summarization, question answering and more. Increasingly, the most powerful language models are built by a few organizations who withhold most model development details (Anthropic, 2023; OpenAI, 2023; Anil et al., 2023; Gemini Team et al., 2023). In particular, the composition of language model pretraining data is often vaguely stated, even in cases where the model itself is released for public use, such as LLaMA 2 (Touvron et al., 2023b). This hinders understanding of the effects of pretraining corpus composition on model capabilities and limitations, and therefore of the models themselves, with impacts on scientific progress as well as on the public who interfaces with these models. We instead target openness and transparency, releasing and documenting a dataset of three trillion tokens alongside tools to reproduce, scrutinize and expand on our work.

Table 1: The Dolma corpus at-a-glance. It consists of three trillion tokens sampled from a diverse set of domains sourced from approximately 200 TB of raw text. It has been extensively cleaned for language model pretraining use.

Our aim is to allow for more individuals and organizations to participate in language model research and development.

  • Data transparency helps developers and users of applications that rely on language models to make more informed decisions (Gebru et al., 2021). For example, increased prevalence of documents or terms in language model pretraining data has been linked to better performance on related tasks (Razeghi et al., 2022; Kandpal et al., 2023), and social biases in pretraining data (Feng et al., 2023; Navigli et al., 2023; Seshadri et al., 2023) may necessitate additional consideration in some domains.
  • Open pretraining data is necessary for analysis via empirical studies exploring how data composition influences model behavior, allowing the modeling community to interrogate and improve current data curation practices (Longpre et al., 2023; Gao, 2021; Elazar et al., 2023). Examples of this research include memorization (Carlini et al., 2022b; Chang et al., 2023), deduplication (Lee et al., 2022), adversarial attacks (Wallace et al., 2021), benchmark contamination (Magar and Schwartz, 2022), and training data attribution (Hammoudeh and Lowd, 2022; Grosse et al., 2023) - Access to data is required for successful development of open language models. For example, newer language models may offer functionality such as attribution of generations to pretraining data (Borgeaud et al., 2022).

To support broader participation and inquiry in these lines of research, we present Data for Open Language Models’ Appetite (Dolma), an open corpus of three trillion tokens designed to support language model pretraining research. Pretraining data mixes are often motivated by a desire to capture so-called “general-purpose” English. We source much of our data from sources similar to those present in past work, including a mix of web text from Common Crawl, scientific research from Semantic Scholar, code from GitHub, public domain books, social media posts from Reddit, and encyclopedic materials from Wikipedia. We compare our dataset to a variety of popular pretraining corpora that are available publicly, and find that Dolma offers a larger pool of tokens at comparable quality and with equally diverse data composition. Dolma has been already used to pretrain OLMo (Groeneveld et al., 2024), a family of state-of-the-art models designed to facilitate the science of language modeling.

In summary, our contributions are two-fold:

  • We release the Dolma Corpus, a diverse, multi-source collection of 3T tokens across 5B documents acquired from 7 different data sources that are (i) commonly seen in large-scale language model pretraining and (ii) accessible to the general public. Table 1 provides a high-level overview of the amount of data from each source.
  • We open source the Dolma Toolkit, a high-performance, portable tool designed to efficiently curate large datasets for language model pre-training. Through this toolkit, practitioners can reproduce our curation effort and develop their own data curation pipelines.

The remainder of this manuscript is organized as follows: we first describe the desiderata and design principles that guided the creation of Dolma (§2). We then document the methods applied to process the raw text (§3), including filters for language, “quality,” content filtering, and deduplication. Further processing was required to prepare Dolma for use as a pretraining corpus (§4), including benchmark decontamination and selecting a mixture rate. Throughout, we conduct ablation experiments, measuring domain fit through perplexity tracking and downstream performance on a set of twelve question-answering, common sense, and reasoning tasks. We conclude by discussing the process of releasing Dolma (§5).

2 Dolma Design Goals

To support large-scale LM pretraining research, we set four design requirements around openness, consistency with prior work, size, and risk mitigation. We discuss each in turn.

Dolma’s curation should be consistent with prior language model pretraining recipes. By matching data sources and methods used to create other language modeling corpora, to the extent they are known, we enable the broader research community to use our corpus and resulting model artifacts to study (and scrutinize) language models being developed today, even those developed behind closed doors. In this reproduction effort, we follow established practices (i.e., use data sources and techniques for preprocessing and filtering content that appears frequently across language modeling efforts) to the extent they are known, and defer to analysis, experimentation and educated guesses when best practice isn’t known or implementations differ in subtle ways.1 Notably, this also means scoping Dolma to English-only text to better leverage known curation practices and maximize generalizability of scientific work on Dolma to existing language models.2 To illustrate the open-ended nature of this reproduction effort, we provide a detailed summary of known (and unknown) data curation practices for some of the largest proprietary (e.g., GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al., 2023), Claude (Anthropic, 2023)) as well as open (e.g., OPT (Zhang, 2022), LLaMA (Touvron et al., 2023a), Llama 2 (Touvron et al., 2023b)) language models in Appendix §C.

Dolma should support training of large models. Hoffmann et al. (2022) suggested that one can train compute-optimal models by maintaining a fixed ratio between language model size (in parameters) and minimum number of training tokens. Recent models that follow these “scaling laws,” such as LLaMA 2 (Touvron et al., 2023b), appear to show there is still room for performance improvement by increasing the number of training tokens.3 As this is an active area of research, we aim for a sufficiently large corpus to allow further study of the relationship between model and dataset size—2-3T tokens.

1 We note this reproduction effort does not seek to replicate specific language model pretraining data implementations. Instead, we reproduce a range of data curation themes.

2 Recognizing that this focus reinforces the assumption of English as the “default” language, we hope to expand Dolma to more languages in the future. We release our data curation tools to support such efforts.

3 See Figure 5 in Touvron et al. (2023b), in which loss has not converged even at 2T tokens.

Dolma should contribute to open corpora. Lack of access to pretraining corpora alongside corresponding language models has been a major obstacle for the broader research community. Very few open models out of the hundreds released in the recent years are released alongside their training data: T5 and C4 (Raffel et al., 2020), BLOOM and ROOTS (Leong et al., 2022; Piktus et al., 2023), GPT-J/GPT-NeoX/Pythia and Pile (Wang and Komatsuzaki, 2021; Black et al., 2022; Biderman et al., 2023; Gao et al., 2020), INCITE and RedPajama v1 (Together Computer, 2023b,c). However, limitations in these prior corpora have motivated need for a new dataset such as Dolma:

  • C4 (Raffel et al., 2020), Pile (Gao et al., 2020), and Falcon (Almazrouei et al., 2023) are highquality datasets with demonstrated use in training language models, but are unfortunately limited in scale. ROOTS (Piktus et al., 2023) is large and diverse but given its multilingual focus, its English-only portion is also too small to train English-only models.
  • RedPajama v2 (Together Computer, 2023a) meet our criteria of scale but don’t reflect representative distributions over sources of content commonly seen in curating the largest language models (e.g., scientific papers, code).
  • RedPajama v1 (Together Computer, 2023c) is most similar to our effort and a source of inspiration when designing Dolma. While RedPajama v1 was a reproduction of the LLaMA (Touvron et al., 2023a) training data, we have a broader reproduction target which required diving into data sources that RedPajama v1 did not pursue, including larger collections of scientific papers and conversational forums like Reddit.

In all, we expand on these works by creating the largest curated open pretraining corpus to date. We define openness to mean (i) sharing the data itself, which in turn informs our choice of data sources, and (ii) documenting the process used to curate it, including decisions made with justifications, and open-source implementations to allow others to reproduce our work and create new corpora. The resulting open-source high-performance toolkit enables researchers to implement their own data pipelines to either further refine Dolma or process their own datasets.

Dolma’s curation should minimize risk of harm to individuals Curating a pretraining corpus may introduce risk to individuals, either by facilitating access to information that is present in the corpus, or by enabling training of harmful models. To minimize these risk while meeting our stated goals, we engaged with legal and ethics experts from within our organizations early in the project and evaluated data design decisions based on their feedback on a case-by-case basis. Broadly, we follow accepted practices when available (e.g., masking of certain personal identifiable information), and take a measured approach when diverging opinions exist in the literature (e.g., most effective approach to identify and remove toxic content). Further, we provide tools to request data removal4 As the landscape around data and AI is evolving, we do not claim that our decisions are correct. Nevertheless, we do believe in compromising on desired research artifact properties like model reproducibility, performance, and extensibility in cases of significant harm to individuals.

Even with these design goals to help scope our effort, there remain myriad decisions we must make when curating Dolma. Without a single clear recipe to follow from prior work, we rely on two principles to guide our decisions:

  • (i) Use an evaluation suite, wisely. As part of the OLMo project Groeneveld et al. (2024), we developed an evaluation suite (Groeneveld et al., 2023; details in Appendix D) to offer guidance during pretraining across a range of capabilities and tasks. Whenever possible, data decisions are made to improve its metrics. However, our evaluation suite is not perfect. For example, it cannot fully measure the effect of adding data sources that benefit models after instruction tuning5. In these cases, we make sure that any one decision does not drastically decrease performance of any of the tasks in the suite.
  • (ii) Favor decisions that advance research directions of interest to our organization. Where the above principles do not offer guidance, we seek to build a corpus that will be most useful in research at academic or non-profit organizations like those of the authors. This does not necessarily mean maximizing benchmark performance; many desirable dataset interventions are at odds with each other.

4 Available at the following URL: forms.gle/FzpUXLJhE57JLJ3f8 5For example, the effect of adding code to pretraining data cannot be fully measured until models are able to generate executable code. However, such capability is typically observed after models are further finetuned to follow instructions (Muennighoff et al., 2023a).

3 Creating Dolma

Curation of pretraining data often requires defining complex pipelines that transform raw data from multiple sources into a single collection of cleaned, plain text documents. Such a pipeline should support (cid:19) acquisition of content from diverse sources (e.g., crawling, API ingestion, bulk processing), data Zcleanup through the use of filtering heuristics and content classifiers, and mixing into a final dataset (e.g., deduplication, up/down-sampling of sources).

In curating Dolma, we create a high-performance toolkit to facilitate efficient processing on hundreds of terabytes of text content. The toolkit is designed for high portability: it can run any platform from consumer hardware (thus facilitating the development of new pipelines) to a distributed cluster environment (ideal for processing large datasets like Dolma). Through the curation of Dolma, we implemented commonly used Zcleanup and mixing steps that can be used to reproduce and curate similar datasets to Gopher, C4, and OpenWebText.

Using our toolkit, we develop and combine four kinds of data transformations that match Dolma desiderata we introduced in §2:

  • Z Language filtering. To create our English-only corpus, we rely on scalable tools for automated language identification. Identification is performed using fastText’s (Joulin et al., 2016a) language ID model. Depending on the length of documents in each source, we either process the entire text at once or average the score of paragraphs. Documents with a sufficiently low English score are removed.7 We do not perform any language identification on datasets that are distributed already pre-filtered to English-only documents.8 We note that language filtering is never perfect, and multilingual data is never completely removed from pretraining corpora (Blevins and Zettlemoyer, 2022).
  • Z Quality filtering. It is common practice to remove text that is considered “low quality,” though there is no broad consensus about what this means or how best to operationalize this with automated tools.9 For web sources, we follow recommendations in Gopher (Rae et al., 2021) and Falcon (Almazrouei et al., 2023) which suggest avoiding model-based quality filters like those used for LLaMA (Touvron et al., 2023a) and GPT-3 (Brown et al., 2020). Instead, we reimplemented and applied heuristics used in C4 (Raffel et al., 2020) and Gopher (Rae et al., 2021) that they used for processing Common Crawl. For other sources, we refer the reader to their corresponding sections as each required bespoke quality filtering strategies.
  • Z Content filtering. Beside removal of low quality, unnatural content, it is standard practice to filter toxic content from pretraining data to reduce risk of toxic generation (Anil et al., 2023; Rae et al., 2021; Thoppilan et al., 2022; Hoffmann et al., 2022; Longpre et al., 2023). We follow this practice and implement a mix of rulesand classifier-based toxicity filtering techniques depending on the source.10. Large pretraining corpora have also be shown to include personal identifiable information (PII; Elazar et al., 2023), which models are able to reproduce at inference time (Carlini et al., 2022a; Chen et al., 2023b). In Dolma, we identify content for removal through a fastText classifier trained on Jigsaw Toxic Comments (cjadams et al., 2017) and a series of regular expressions targeting PII categories from Subramani et al. (2023); Elazar et al. (2023).

6 For example, we would like Dolma to support future investigations of the effect of pretraining on code; while our current evaluation suite is not properly designed to fully assess the impact of code data, we nevertheless include code in our corpus, to further research on this topic. Similarly, while previous research has suggested that removing

7 Keeping a low threshold can help mitigate inherent biases (Blodgett et al., 2016) that language detectors have against English dialects spoken by minoritized groups. Scores used for each source are reported in subsequent sections.

8 These datasets may have been filtered to English content using other classifiers and thresholds. 9The term “quality filter,” while widely used in literature, does not appropriately describe the outcome of filtering a dataset. Quality might be perceived as a comment on the informativeness, comprehensiveness, or other characteristics valued by humans. However, the filters used in Dolma and other language models efforts select text according to criteria that are inherently ideological (Gururangan et al., 2022).

10 Like in the case of “quality”, there is no single definition for “toxicity”; rather, specific definitions vary depending on task (Vidgen and Derczynski, 2020) and dataset curators’ social identities (Santy et al., 2023); annotators’ beliefs also influence toxic language detection (Sap et al., 2021) Using models to identify toxic

  • Deduplication. Deduplication of pretraining corpora has been shown to be an effective technique to improve token efficiency during model training (Lee et al., 2022; Abbas et al., 2023; Tirumala et al., 2023). In preparing Dolma, we use a combination of URL, document, and paragraph-level deduplication. We achieve linear-time deduplication through the use of a Bloom filters (Bloom, 1970). We perform this deduplication across files from the same subset (e.g., deduplicate all documents in the web subset), but not across sources (e.g., do not check if any web document also appears in the code subset).

In the reminder of this section, we provide a detailed explanation of how the steps above are implemented for each data source shown in Table 1. To support our decisions, we leverage two tools. First, we inspect the output of our pipelines using the WIMBD tools (Elazar et al., 2023). This approach allows us to efficiently spot issues without having to train any models.

Then, we conduct data ablations using a 1 billion parameter decoder-only model trained up to 150 billion tokens; we provide a detailed description of our experimental setup in § D.1. Through these ablations, we can compare the outcome of our data pipelines on our evaluation suite. The evaluation suite is comprised of 18 domains on which we measure perplexity to estimate language fit (Magnusson et al., 2023; described in §D.2), as well as 7 downstream tasks on which we evaluate question answering, reasoning, and commonsense capabilities of resulting models (described in §D.3). For the reminder of this section, we present a subset of results on the evaluation suite; we include all our experimental results in Appendix K. When making decisions, we prioritize interventions that optimize metrics in downstream tasks over language fit.

3.1 Web Pipeline

Figure 1: Overview of the web processing pipeline in Dolma.

The web subset of Dolma was derived from Common Crawl.11 Common Crawl is a collection of over 250 billion pages that were crawled since 2007. It is organized in snapshots, each correspond to a full crawl over its seed URLs. In November 2023, there were 89 snapshots. Dolma was curated from 25 snapshots.12 collected between 2020-05 to 2023-06.

3.1.1 Data Acquisition and Z Language Filtering

Following data curation practices used to develop LLaMA (Touvron et al., 2023a), our web pipeline leverages CCNet (Wenzek et al., 2020b) to perform language filtering and initial content deduplication.

11 commoncrawl.org

12 We use just enough snapshots to meet the volume goal described in §2 — at least 2T tokens.

This tool was also used for the Common Crawl subset of RedPajama v1 (Together Computer, 2023c) and RedPajama v2 (Together Computer, 2023a). CCNet processes each web page with a fastText language identification model13 to determine the primary language for each document; we keep all pages with English document score greater or equal to 0.5 (removed 61.7% of web pages by size). Further, CCNet identifies and removes very common paragraphs by grouping shards in each snapshot into small sets and removing duplicated paragraphs in each. This step removed approximately 70% of paragraphs, primarily consisting of headers and navigation elements. Overall, CCNet pipeline filters out 84.2% of the content in Common Crawl, from 175.1 TB to 27.7 TB. More details provided in Appendix J.4.

3.1.2 Z Quality Filtering

Web crawled data requires significant cleanup before it can be used for language model pretraining. This step removes artifacts introduced by the conversion from HTML to plain text (e.g., page headers, ill-formatted text) and discards pages that do not contain enough “prose-like” text (e.g., repeated text, short segments). First, CCNet natively provides a quality filter using KenLM (Heafield, 2011) perplexity to group documents into buckets based on Wikipedia-likeness; this buckets are often interpreted as high (21.9%), medium (28.5%), or low (49.6%) quality context. However, per arguments posed in Rae et al. (2021) and Almazrouei et al. (2023) against model-based quality filters, as well as our own manual inspections of content distributed between these buckets, we opted not use these CCNet quality scores. Instead, in Dolma, we achieve quality filtering by combining heuristics introduced by Gopher (Rae et al., 2021) and C4 (Raffel et al., 2020). Specifically we keep all the Gopher rules (henceforth, Gopher All) and keep a single heuristic from C4 designed to remove paragraphs that do not end in punctuation (C4 NoPunc; as opposed to C4 All). Detailed description of filtering rules provided in Appendix J.4.

Figure 2: Model ablations for quality filters of the web processing pipeline. We find that a combination of C4 and Gopher rules leads to improvements in both language fit (left, on the C4 100 Domains subset of Paloma (Magnusson et al., 2023)) and downstream performance (right, on HellaSwag Zellers et al. (2019)).

Ablation results shown in Figure 2 validate our filtering strategy: we find that C4 NoPunc on its own outperforms both C4 All as well as Gopher All on both perplexity and downstream tasks. Finally, combining Gopher All + C4 NoPunc offers the best performance. In all, the Gopher rules tagged 15.23% of UTF-8 characters for removal, while the C4 rule tagged 22.73% of characters for removal. When comparing our heuristics against CCNet’s quality scores, the remaining documents after filtering fall into CCNet buckets of high (22.8%), medium (26.2%) and low (51.0%) quality, revealing very little correlation between model and heuristic-based quality filters.

Using the tool from Elazar et al. (2023), we inspect our filtered dataset for occurrences of repeated n-grams. Despite filtering using Gopher and C4 rules, we still found undesirable texts such as repeated sequences of ‘-’ 100 times, occurring over 60 million times, or repeated sequences of ‘bla’, occurring 19.1 million times (see Table 2). Based on this, we implement n-gram heuristics to identify and remove documents containing these sequences; specifically, we remove any repeated sequence longer than 100 UTF-8 characters. While this only removed 0.003% of the total characters in the dataset, removal of these documents can prevent loss spikes during training, as was empirically found14 in Scao et al. (2022). We also note that this was a fairly conservative heuristic that left many repeated sequences remaining in the dataset; we found from manual inspection of these sequences that they often served as webpage layout elements as opposed to parsing irregularities.

13 https://fasttext.cc/docs/en/language-identification.html

Table 2: Examples of common repeated n-gram sequences in the web subset identified through WIMBD tools (Elazar et al., 2023). Repeted sequences longer than the ones shown here have been removed after being identified by WIBMD.

3.1.3 Z Content Filtering

Filtering Toxic Content Data sampled from the internet may contain harmful or toxic content (Matic et al., 2020; Luccioni and Viviano, 2021; Birhane et al., 2023a,b). As highlighted in § 2, we filter Dolma to reduce harms that might arise from training language models on toxic content. We used the Jigsaw Toxic Comments dataset (cjadams et al., 2017), which contains forum comments tagged with (multilabel) categories “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and/or “identity hate” alongside unlabeled comments, to train two fastText classifiers—a binary “hate” detector and a binary “NSFW” detector:

  1. For our “hate” detector, we group all unlabeled comments and “obscene”-only comments as negatives and left remaining comments as positives.
  2. For our “NSFW” detector, we take all comments tagged as “obscene” as positives and left other remaining comments as negatives. It is important to note this detector only filters toxic content that mentions sexual or obscene topics, not sexual content in general.

For both these models, we run them on Common Crawl sentences15 with a filtering threshold of 0.40 based on manual threshold tuning. We chose our threshold seeking a balance between (1) maximizing precision and recall from inspecting predicted toxic sentences on a single snapshot of Common Crawl, as well as (2) minimizing too much data removal.16 We always remove just the span that has been tagged as toxic, not the full document. We make both of these models available publicly.17 In Figure 3, we compare the effect of two different thresholds for the “hate” and “NSFW” detector. The “High Threshold” configurations remove less content, but generally yield higher perplexity on evaluation set and lower downstream performance. The “Low Threshold” configurations remove more content and generally have higher performance, but remove more units of text (7.3% vs 34.9% and 5.5% vs 29.1%, for “hate” and “NSFW” UTF-8 characters, respectively). Because lower thresholds might lead to false positive, and improved performance can be achieved by combining content filters with quality and deduplication filters, we use the “High Threshold“ versions of the “hate” and “NSFW” filters, removing any sentence with a score greater than or equal to 0.4.

Filtering Personal Identifiable Information Data sampled from the internet can also leak personal identifiable information (PII) of users (Luccioni and Viviano, 2021; Subramani et al., 2023); such PII is abundant in large-scale datasets (Elazar et al., 2023).

14 More information at <github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide/chronicles.md>

15 Identified using BlingFire sentence splitter (Microsoft, 2019). 16For example, the “hate” and “NSFW” detectors filter out 34.9% and 29.1% of tokens from Common Crawl at thresholds of 0.0004 and 0.00017, respectively.

17 “NSFW” fastText tagger and “hate” fastText tagger.

Figure 3: Model ablations for toxic content filters of the web processing pipeline. We find that adopting a “Low Threshold” for the “hate” and “NSFW” toxic content filters results to improvements in both lanugage fit (left, on the C4 100 Domains subset of Paloma (Magnusson et al., 2023)) and downstream performance (right, on HellaSwag Zellers et al. (2019)); however, more content is removed (7.3% vs 34.9% and 5.5% vs 29.1%, for “hate” and “NSFW” UTF-8 characters, respectively).

PII detection can be accomplished using model-based tools (Dernoncourt et al., 2017; Microsoft, 2018; Hathurusinghe et al., 2021; Lison et al., 2021; Lukas et al., 2023; Mazzarino et al., 2023) or rule-based approaches (Aura et al., 2006; Elazar et al., 2023). The former generally offer better performance, while the latter are faster.

The size of Dolma makes impractical to use model-based tools; instead, we rely on carefully crafted regular expressions. Following the findings of Subramani et al. (2023), we tag three kinds of PII that can be detected with sufficient accuracy: email addresses18, IP addresses19, and phone numbers20. Once spans are tagged, we employ different processing strategies based on the their density on each document:

  • 5 or fewer PII spans detected: we replace all spans on a page with special tokens |||EMAIL_ADDRESS|||, |||PHONE_NUMBER|||, and |||IP_ADDRESS||| for email addresses, phone numbers, and IP addresses respectively21. In total, we find 0.02% of documents in the 25 Common Crawl snapshots match this filter.
  • 6 or more PII spans detected: we remove any document that contains 6 or more matching PII spans. We this approach because pages containing abundant phone numbers and email addresses are likely to pose a greater risk of discosing other PII classes. 0.001% of documents in the 25 Common Crawl snapshots match this filter.

In Figure 4, we show results of experiment designed to quantify the impact of our PII strategy. Overall, we find that, in both language modeling and downstream tasks, PII removal and masking has no discernible effect on model performance.

3.1.4 Deduplication

Recent efforts indicate that the deduplication of data leads to language models that train more efficiently (Lee et al., 2022). Following this principle, we deduplicate data in the web pipeline. We perform three stages of deduplication:

  • (i) Exact URL deduplication: mark pages that share the same URL. No normalization is performed. This filter is primarily intended to remove pages that have been crawled multiple times. Overall, it removes 53.2% of documents in the 25 snapshots used to create Dolma. URL deduplication is commonly used as the first stage for web crawls thanks to its computational efficiency (Agarwal et al., 2009; Koppula et al., 2010; Penedo et al., 2023).
  • (ii) Exact document deduplication: mark pages that contain the same text. No punctuation or whitespace is removed. Empty documents count as duplicates. Overall, it removes an additional 14.9% of documents after URL deduplication.
  • (iii) Exact paragraph deduplication: mark identical paragraphs across pages as duplicates. We keep definition of this unit consistent with previous filters: a paragraph is a span of text separated by the newline UTF-8 character “\n”. Overall, this filter tags 18.7% of documents in the URL-deduplicated set as repeated.
18 Regex: [.\s@,?!;:)(]*([\^\s@]+@[\^\s@,?!;:)(]+?)[.\s@,?!;:)(]?[\s\n\r] 19Regex: \s+\(?\)d{3})\(?[-\. ]*\)d{3})[-. ]?$$d{4}) 20Regex: (?:(?:25[0-5] 2[0-4][0-9] [01]?[0-9]{1,2}).){3}

21 When training models on Dolma, we these special tokens to the tokenizer vocabulary. For all results shown in this paper, we use allenai/gpt-neox-olmo-dolma-v1

Figure 4: 1B model ablations for PII strategies. We found no discernible differences between removing all documents with PIIs, only removing documents with ≥ 5 PII instances and masking the rest, and doing no PII filtering at all.

This multi-stage approach is designed to increase efficiency: stages (i) and (ii) are designed to remove copies of the same item (identical pages might have multiple URLs, such in the case of the same news article being included in multiple online newspaper), thus can be executed before before any content or quality filtering, reducing the number of pages to process. In contrast, stage (iii) removes repeated content that appears on the different pages (such as the same byline appearing under all articles written by the same author), thus altering portion of pages and potentially disrupting content analysis. All stages use a Bloom filter (Bloom, 1970) data structure for efficient content deduplication.

3.1.5 Putting It All Together

How do steps in the pipeline compose? To summarize, the Dolma web pipeline transform the output of CCNet by first performing URL and document-level deduplication, followed by quality filtering (Gopher, C4 NoPunc), content filtering (toxic content, PII), and, finally, paragraph-level deduplication. But What’s the combined outcome of the filtering?

Figure 5: Compounding effect of quality filtering, content filtering, and paragraph-level deduplication on 1B model ablations. Combination of all components in the pipeline leads to improvements in both language fit (left, on the C4 100 Domains subset of Paloma (Magnusson et al., 2023)) and downstream performance (right, on HellaSwag Zellers et al. (2019)).

In Figure 5, we show the compounding effect of the stages of the pipeline. We find that the combination of the three stages achieve the best performance on downstream tasks, while content filtering slightly hurts language fit of C4 100 domains subset. As stated in §2, we leverage downstream evaluation tasks to make decision; thus we use all steps in the pipeline when creating Dolma.

Data distribution We use the tool from Elazar et al. (2023) to inspect the final data composition in Figure 6. In particular, we analyze web domain, year, and language distributions.

  • (a) Web (URL) domains
  • (b) Dates of documents
  • (c) Non-English languages

Figure 6: Frequencies over different document metadata as computed using the What’s In My Big Data? tool from Elazar et al. (2023). In subfigure (c), un denotes documents whose language could not be identified; long indicates documents that are too long to be processed with the tool’s language ID module.

We note that Dolma contains documents from a broad set of internet domains, mostly from 2020, 2022, and 2021. The most common internet domains in Dolma, per token, are patents.google.com, followed by www.nature.com and www.frontiersin.org. In fact, similar to other corpora reported in Elazar et al. (2023), 63.6% of Dolma’s web documents are from ‘.com’ sites (followed then by ‘.org’ and ‘.co.uk’ sites). Finally, as all language identification tools are imperfect, we summarize what languages are remaining post English-only filtering: We find the most common language after English is not well identified (‘un’) with 0.86% of the documents, followed by 0.06% of the documents identified as Chinese.

Do quality and content filters have similar effects? In order to further understand how filters described in § 3.1.2 and § 3.1.3 interact with each other, we perform a correlation analysis on a subset of documents sampled from our pipeline.

  • (a) Head
  • (b) Middle
  • (c) Tail

Figure 7: Pearson Correlation of filters on the Head, Middle, and Tail parts of our Common Crawl data. The correlation is computed for 24M, 20M, and 43M documents respectively. The filters are Gopher=Gopher rules from Rae et al. (2021), Dedup.=Deduplication, PII=Personal Identifiable Information, Hate=Hate Speech and Decont.=Decontamination.

The correlation among the documents flagged for removal by our Common Crawl filters is depicted in Figure 7. We find that correlations are generally low, thus our filters select fairly different documents and are not redundant. There is some positive correlation between our PII (Personal Identifiable Information) filters and filters removing hate speech. This is likely because hate speech is often directed at people. The Gopher filtering rules correlate negatively with our deduplication, especially for the high-perplexity tail part of our data. This is due to the Gopher rules removing many high-perplexity documents such as random strings, which are not caught by deduplication due to their randomness. As these random strings likely do not contribute to a better understanding of language, it is important to filter them out and thus rely on filters beyond deduplication.

3.2 Code Pipeline

Figure 8: Overview of the data pipeline to process code documents.

3.2.1 Data Acquisition and Z Language Filtering

We derive the code subset of Dolma from The Stack (Kocetkov et al., 2022), a collection of permissively-licensed GitHub repositories. We use the near-deduplicated version as a starting point, thus removing the need to perform deduplication ourselves. The raw version of this dataset was collected in March 2023. We filter data-heavy documents by removing files with extensions such as JSON and CSV.

3.2.2 Z Quality Filtering

We apply heuristics derived from RedPajama v1 (Together Computer, 2023c) and StarCoder (Li et al., 2023) datasets. The former consist of rules to remove repetitive file preambles, such as license statements22 and documents with excessively long lines or mostly numerical content. Overall, RedPajama Rules (RPJ) are designed to remove files that are mostly data or generated through templates. To further select high quality code snippets, we leverage rules from the StarCoder pipeline; these heuristics filter GitHub repositories with no to few stars, files with too few or too many comments, and HTML files with low code-to-text ratio. For a detailed description of these rules, see §J.4.

Figure 9: Comparison of quality filtering when using RedPajama Rules (RPJ) rules or RPJ and StarCoder rules combined. Combining the two rulesets results in slightly improved perplexity on code documents (left, HumanEval; Chen et al., 2021b ), more stable perplexity curves on non-code test sets (center, on the C4 100 Domains subset of Paloma; Magnusson et al., 2023), and slightly improved downstream performance (right, on HellaSwag; Zellers et al., 2019).

In Figure 9, we present a comparison between RedPajama (RPJ) and StarCoder rules. In our ablations we find that, compared to RPJ rules alone, RPJ and StarCoder combined lead to lower perplexity on code datasets (e.g., HumanEval; Chen et al., 2021b), more stable perplexity during training on non-code test sets (e.g., C4 100 Domains subset of Paloma; Magnusson et al., 2023), and improved downstream performance (e.g., HellaSwag; Zellers et al., 2019). Therefore, we chose to use this combination when creating the final mix for Dolma.

22 We keep this information in the metadata associated with each document in Dolma.

3.2.3 Z Content Filtering

We apply the same filtering rules to from the web pipeline (§ 3.1) to mask personal identifiable information (PII). Documents with greater than 5 PII instances are removed from Dolma. In all other instances, emails, phone numbers, and IP addresses are masked using special tokens.

We also remove code secrets or personal information. To do so, we use the detect-secrets (Yelp, 2013) library and remove any documents with a match.

3.2.4 Deduplication

We used the already-deduplicated version of The Stack published by Kocetkov et al. (2022); their approach uses the pipeline first introduced by Allal et al. (2023), which uses MinHash Broder (2002) and Locally Sensitive Hashing to find similar documents.

3.3 Conversational Forums Pipeline

Figure 10: Overview of the data pipeline to process conversational forums.

3.3.1 Data Acquisition and Z Language Filtering

The conversational subset of Dolma was derived from the Pushshift Reddit dataset (Baumgartner et al., 2020b), a large collection of forum conversations collected through Reddit’s data API and distributed by the Pushshift project. We derive the conversational subset in Dolma from 378M posts from Reddit, from December 2005 until March 2023. We include both submissions—initial message in conversations on Reddit—and comments—replies to messages—in the dataset. We treat all submissions and comments as independent documents without any structure or connection to the thread they appear in; in our evaluation, this simplified representation yields better performance on downstream tasks. A discussion of this trade-off is presented in Appendix E.

For consistency, we use same strategy as the web pipeline to filter non English content. In particular, we keep submission and comments with an English score greater than 0.5.

3.3.2 Z Quality Filtering

Conversational forum data must be adequately cleaned to remove content that is too short, repetitive, or is negatively ranked by the community it was submitted to. We use the pipeline introduced by Henderson et al. (2019) to facilitate cleanup of submissions and comments using Google Dataflow23. We remove comments shorter than 500 characters, and submissions shorter than 400 characters24. We also remove documents over 40,000 characters in length.

23 https://cloud.google.com/dataflow

24 Qualitative inspection of the data suggested that submissions are of higher quality than comments; thus, we use a more permissive minimum length.

We remove comments with fewer than 3 votes25, as lower score are associated with comments that are deeply nested in a conversational thread (Weninger et al., 2013) or content that is more likely to results in emotionally charged discourse (Davis and Graham, 2021). Votes have been used as a signal in constructing the WebText (Radford et al., 2019) and OpenWebText (Peterson, 2020) corpora. We discard documents that have been deleted by their authors or removed by moderators; further, documents that have been labeled by their authors as “over 18” were also removed. We exclude any document originated from any of the 26,123 banned and not safe for work subreddits26 we curated.

3.3.3 Z Content Filtering

We apply the same filtering rules to used in the web pipeline (§ 3.1.3) to remove toxic content and mask PII. Unlike in the case of the web pipeline, we fully remove a document if part of it are tagged as toxic. We employ this strategy because content from Reddit is shorter in length, thus it is more likely that a single sentence classified as toxic is a strong indication of the entire document being toxic as well.

3.3.4 Deduplication

We employ the same strategy used in the web pipeline (§3.1.4). Since submissions and comments are shorter than web documents, we only deduplicate at a document-level. This strategy is useful to reduce the incidence of “Copy pasta” (blocks of text that get often repeated across many comments and subreddits for comedic effect) and other repetitive information.

3.4 Other Data Sources

In this section, we briefly summarize additional high-quality sources that were used to derive Dolma. For more details on collection and processing, see Appendix § J.3 and §J.4.

C4 for Curated Web Content Similarly to LLaMA (Touvron et al., 2023a), we include documents from C4 Raffel et al. (2020) in the Dolma dataset. We further refine this data by reprocessing it through our web pipeline to remove long, repeated sequences (§ 3.1.2) and duplicates (§3.1.4). Finally, we also perform PII masking as described in (§3.1.3);

PeS2o for Academic Literature The PeS2o dataset (Soldaini and Lo, 2023) is a collection of approximately 40 million open-access academic papers that have been cleaned, filtered, and formatted for pre-training of language models. It is derived from the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020). As this dataset has been created for language modeling purposes, we use it as-is.

[ 
    Project Gutenberg for Books Project Gutenberg is a repository of over 70 thousand public domain books. We collected Project Gutenberg’s archive in April 2023. We use the same fastTextbased language identification model to identify English language books and include them in Dolma. More details in our Data Sheet § J.
] 

Wikipedia and Wikibooks for Encyclopedic Content This dataset was derived by March 2023 Wikimedia dumps. We use the “English” and “Simple” editions of Wikipedia and Wikibooks as base for the Encyclopedic subset of Dolma. Sources were processed using WikiExtractor27. We remove any document with 25 or fewer UTF-8-segmented words, as we found shorter pages to either be the result of short, templated pages (e.g., pages containing only a few words and an information box) or XML parsing errors.

25 The total votes for each documents are obtained by computing the difference between positive votes, also known as “upvotes”, negative votes or “downvotes”.

26 The list is available at https://github.com/allenai/dolma/blob/main/sources/reddit/atomic_content_v5/subreddit_blocklist.txt. The list was obtained by merging several sources that tracked banned subreddits (mostly from posts on Reddit itself). We also measured the fraction of posts within a subreddit tagged as NSFW, and blocked the subreddit when this fraction exceeded 10%.

27 <github.com/attardi/wikiextractor>, v.3.0.7, commit prefix 8f1b434.

4 Training a Language Model on Dolma

As a final validation step of the Dolma pipeline, we train, evaluate and release a decoder-only, autoregressive language model which we call Olmo-1b. In this section, we discuss potential approaches additional dataset curation decisions specific to model training. In §4.1, we present an approach to remove benchmark tasks—i.e., decontaminate—from Dolma. Then, in § 4.2, we discuss considerations when combining—i.e., mixing—the various document subsets in Dolma to obtain the final pretraining corpus. Finally, in § 4.3, we present experimental results of the resulting Olmo-1b model. Olmo-1b uses GPT-NeoX tokenizer (Black et al., 2022), which we found to be well suited for Dolma; we present results supporting our decision in Appendix F.

4.1 Strategies for Benchmark Decontamination in Dolma

In this section we experiment with approaches to remove benchmark contamination from pretraining and select which is ultimately used in Olmo-1b. Large-scale language datasets contain copies of benchmarks that are commonly used to evaluate language models (Dodge et al., 2021; Yang et al., 2023; Elazar et al., 2023). The impact of such contamination is currently debated. For example, Lee et al. (2022) showed that removing duplicates of validation data from C4 pretraining increases perplexity on the previously duplicated validation data. Meanwhile, work examining post-hoc performance difference between contaminated and uncontaminated downstream data finds no consistent positive or negative impact (Chowdhery et al., 2022; Brown et al., 2020; OpenAI, 2023). To start, we focus on the removal of perplexity benchmark contamination, and we measure the extent of downstream task contamination. We experiment with removing contamination with respect to an early version of Paloma (Magnusson et al., 2023), a benchmark of 585 text domains designed to evaluate language model fit to diverse sources. This selection of perplexity evaluations is detailed in Appendix D.

Decontamination strategy for perplexity evaluation Using the paragraph deduplication tools described in § 3.1.4, we mark any paragraph in Dolma as contaminated if (i) it is longer than 13 Unicode-segmented tokens28 and (ii) it appears in any of the documents in Paloma. In preliminary experiments on decontaminating C4 (Raffel et al., 2020) against an early version of Paloma, we compare the paragraph-based decontamination technique described above with exact-matching whole documents. Results show that document-based decontamination yields lower matching rate, with only 1 of 12 subsets with greater than 1% contaminated documents29. However, when considering paragraph-based decontamination, 6 of 12 perplexity tasks have greater than 1% of documents contaminated. Since the latter better reflect expected contamination rates, we chose it for the reminder of this section.

Lastly, we consider two ways of removing contamination. In preliminary experiments on C4, we find that removing just the contaminated paragraphs by excluding them from documents removes 0.01% of tokens, while removing whole documents with any contamination removes 0.02% of tokens. In either case 0.01% of documents are affected. Given that each have relatively small impact, we opt for removing full documents to avoid disrupting reading order, though this does bias towards removing longer documents.

Decontamination results for perplexity evaluation To assess the risk of our decontamination approach, we train30 two 1B parameter models on a 221B token subset of RedPajama v1 (Together Computer, 2023c), the corpus most similar to Dolma’s intended composition at the time of experimenting. The first model is trained on RedPajama v1 as-is, while the second uses the same corpus after the paragraph-matching, document-removal decontamination approach described above. On this subset, our decontamination approach removes 2.17% of unicode tokens and 0.66% of documents. In Table 3 we show that differences in perplexity and downstream task performance are minimal and do not trend consistently positive or negative. For perplexity, 7 sources degrade and 6 improve; for downstream tasks, 5 degrade and 4 improve. The largest degradation in a perplexity source is 22.0 to evaluation setup.

28 Like in Elazar et al. (2023), we only consider paragraph of sufficient length to avoid false positive matches. 29C4 100 Domains subset, which is directly constructed from C4. 30This experiment uses the setup described in Appendix D, including model configuration, optimizer, and

Table 3: Performance differences with and without our decontamination approach on 1B models trained on RedPajama v1 (Together Computer, 2023c). Perplexity (ppl) results are from Paloma and downstream (end task) results are from the tasks listed in Appendix D plus COPA (Gordon et al., 2012). We find no evidence that decontamination degrades overall model performance.

  • 22.3 on Penn Tree Bank. The largest degradation in a downstream task is a drop of 1.5% accuracy on SCIQ to 84.8%. In conclusion, results show no consistent evidence of performance degradation with decontamination.

Decontamination in Olmo-1b. As our experiments have derisked our approach for removing benchmark contamination, we apply it to our model trained on Dolma. The finalized approach for removing overlap with Paloma is detailed in Magnusson et al. (2023). It applies the steps discussed in this section with the addition of a filter that ignores overlaps consisting of only punctuation, spaces, and emoji. These types of tokens can be arbitrarily repeated in text formatting, leading to common n-grams greater than our 13-gram threshold. On the final Dolma corpus used to train Olmo-1b, our approach finds less than 0.001% characters in training data contaminated, and removes fewer than 0.02% of documents.

Measuring possible contamination of downstream tasks. We measure data contamination in Dolma. We follow the same setup from WIMBD (Elazar et al., 2023) and compute the percentage of instances from tasks with two or more inputs (e.g., natural language inference) that can be found in a single document. This serves as an upper bound of exact-match contamination in Dolma. We consider 82 datasets from PromptSource (Bach et al., 2022), and report the datasets that at least 5% of their test sets can be found in Dolma. We report the results in Figure 11.

Figure 11: Contamination percentages of datasets from PromptSource (Bach et al., 2022).

Results indicate that portion of datasets in Promptsource appear in Dolma. Six datasets are completely contaminated (100%): the Winograd Schema Challenge (Levesque et al., 2012), Sick (Marelli et al., 2014), AX from GLUE (Wang et al., 2018), SemEval (specifically, Task 1 from 2014), COPA from SuperGLUE (Roemmele et al., 2011), and AXb (the diagnostic task) from SuperGLUE (Wang et al., 2019). In addition, other datasets are mostly contaminated, with over 90% of their test sets appearing in Dolma documents: OpenAI HumanEval (Chen et al., 2021a), WIC from SuperGLUE (Pilehvar and Camacho-Collados, 2019), ESNLI (Camburu et al., 2018), and SNLI (Bowman et al., 2015). We note that the contaminated datasets have been excluded from the downstream tasks we use for model evaluation (c.r.f. Appendix D).

4.2 Strategies for Subsets Mixing and Upsampling with Dolma

Like the pretraining corpora of nearly every large-scale language model, Dolma is a multi-source dataset. Training on Dolma thus requires a mixing strategy that determines how much data from each source to include, and potentially which sources to upsample. Like other multi-source corpora (e.g., ROOTS (Laurenccon et al., 2023), the Pile (Gao et al., 2020), RedPajama v1 (Together Computer, 2023c)),31 Dolma does not prescribe a single mixing strategy. We refer the reader to Rae et al. (2021) for an example of how one might programmatically search over mixing configurations to maximize performance. Here, we perform mixing experiments as an opportunity to answer some research questions about how different data sources interact. We use the same ablation setup described in §3.

How much code is important for pretraining? It is common practice for language models to be pretrained on some amount of code, even if code generation is not the intended task. Some research has suggested that mixing code into training over plain text documents improves performance on reasoning tasks (Madaan et al., 2022). We investigate whether this observation holds for models trained on Dolma, and if so, how much code is needed?

Table 4: Performance of three models pre-trained with increasing amounts of code on three datasets, across 5 random seeds. We measure exact match for bAbI and GSM8K, and Rouge-2 for WebNLG.

We create three mixtures from the C4 and Stack subsets containing 0%, 5% and 15% of code data. On each, we train a 1B model. We evaluate these models on three different reasoning tasks: bAbI (Weston et al., 2015), WebNLG Gardent et al. (2017) and GSM8k Cobbe et al. (2021). For the first two tasks, we follow the experimental setup of Muennighoff et al. (2023b) and evaluate each model in an ICL setup with a changing number of demonstrations (0-5) across 5 random seeds. Muennighoff et al. (2023b) show that adding code to pre-training data improves ICL performance on bAbI and WebNLG and they suggest that code improves long-range state-tracking capabilities. Our experiments, as shown in Table 4, corroborate these findings: while the C4-only model fails on all bAbI tasks, adding code improves performance, with a similar trend for WebNLG.

On the more difficult GSM8k benchmark, all models failed to get any correct answer in an ICL setup, and even when fine-tuning the models on the entire training set. However, we find that by fine-tuning on program-aided output, where questions are solved by writing Python snippets as described in Gao et al. (2022), code models outperform the C4-only model. These results show that models pre-trained on code can leverage code generation to answer challenging reasoning tasks even when the original task does not directly involve code.

Evaluating mixing strategies for pretraining on Dolma While Dolma does not prescribe a specific source mixture, we analyze some commonly used strategies32 and compare their effect using the Paloma evaluation suite (Magnusson et al., 2023). Specifically, we present and evaluate four possible data mixtures in Table 5.

We show results of mixtures in Figure 12. Overall, we observe that the different mixtures have an effect on the ability of resulting models to capture specific subdomains. All mixtures show similar perplexity scores on pages sampled from 100 domains from C4 (Figure 12, left), indicating their general effectiveness at modeling web documents. On the other hand, we note how models struggle to model specialized domains unless they are exposed to them. As an example, a model trained on the Web-only mix struggles to represent data in the code domain (Figure 12, center, HumanEval). Finally, we use results on the S2ORC subset of M2D2, which consists of academic papers, to illustrate how different data mixtures affect perplexity. As is it the case with code, Web-only model exhibits higer perplexity due to domain mismatch. On the other hand, models trained on Reference+ and Gopher-like mixes achieve lower perplexity than the model trained on the Naïve mix, due to more in-domain content. However, we note that, despite significant differences in the amount of academic papers between Reference+ and Gopher-like (4.9% vs 24.2%), they achieve nearly identical results, suggesting that even a relatively small percentage of in-domain data is sufficient to achieve good domain fit.

31 RedPajama v1 was a reproduction of the multi-source corpus used in LLaMA (Touvron et al., 2023a).

RedPajama v2 (Together Computer, 2023a) focuses solely on Common Crawl and is thus single-source. 32We did not include any social data in these mixes as it was not ready at the time of this experiment.

Table 5: Overview of the mixtures and their composition.

Figure 12: 1B model ablations for different proportions of Dolma data. All mixture perform similarly on web data (left), while excluding code increases perplexity on code datasets (center). Finally, increasing reference material by upsampling papers and Wikipedia yields lower perplexity on S2ORC (right). Overall, source distribution is linked to downstream capabilities; thus, Dolma users should sample subsets according to their needs.

4.3 Evaluating Olmo-1b

In Table 6 we compare Olmo-1b with other 1B models. Note that while parameter count is matched here, only TinyLlama has been trained for a comparable number of tokens while Pythia 1B is trained for nearly 10 times fewer tokens and the data composition of StableLM2 is unknown. Nevertheless we find that Olmo-1b performs better on average than the most comparable model, TinyLlama, outperforming it in 4 out of 8 tasks. Though zero-shot evaluations of downstream tasks are often challenging for these relatively small 1B models, the performance for all the tasks on all the models is above naive random performance. Further details about the downstream tasks is included in Appendix D.

In Figure 13 we assess how the Dolma mix that we use to train Olmo-1b compares to other popular pretraining corpora in terms of perplexity of models where all other variables than pretraining data are controlled. In particular we fix the number of tokens each model is trained on to 150B, so that data scale and differences in learning rate schedule do not confound with the effect from data composition that we intend to study. This analysis uses the 1B baselines from Paloma and evaluates Paloma’s highest-level metric, which computes perplexity over the combination of test sets from 11 data sources. Other more fine-grained perplexity results comparing these baselines are available in Magnusson et al. (2023). The present analysis excludes sources that are not publicly available, involve fringe or toxic text, or that consist of code data not supported by the benchmark decontamination approach we use. This leaves C4 (Raffel et al., 2020), mC4-en (Chung et al., 2023), Wikitext 103 (Merity et al., 2016), Penn Treebank (Marcus et al., 1999; Nunes, 2020), RedPajama (Together Computer, 2023c), Falcon-RefinedWeb (Penedo et al., 2023), Dolma (this work), M2D2 S2ORC (Reid et al., 2022), M2D2 Wikipedia (Reid et al., 2022), C4 100 domains (Chronopoulou et al., 2022), and Dolma 100 Subreddits (this work).

Table 6: Comparison of Olmo-1b against other similarly sized language models. Olmo-1b was trained on 3 trillion tokens from a preliminary version of Dolma (v. 1.5). Overall, Olmo-1b shows better performance than TinyLlama, which has been trained on a similar number of tokens. Olmo-1b outperforms Pythia 1B, but the latter has been trained on one order of magnitude fewer tokens. StableLM2 is included in this table as a reference, but it cannot be fairly compared with the other works since composition of its training data is not known.

Figure 13: Perplexity over all the standard language modeling and fine-grained domain sources in the final, released version of Paloma (Magnusson et al., 2023), excluding code data not supported for decontamination. The models are 1B baselines from Paloma trained on 150B tokens of each corpus. Since Paloma takes stratified samples of hundreds of fine-grained domains, it emphasizes fit to heterogeneous, curated sources more than evaluations on monolithic Common Crawl data like C4. Pile includes the least Common Crawl data, but mostly exhausts the small curated data sources it draws on. Dolma and, to a lesser extent, RedPajama demonstrate the possibility for maintaining this sample efficiency on fit to diverse domains while including large scale Common Crawl data.

Our controlled perplexity analysis reveals the importance of including non-Common Crawl data from diverse curated sources. The metric that we use from Paloma surfaces how models fit to more heterogeneous data, because it samples marked domains from each source equally rather than by their unequal proportions in the source. Intuitively, the baseline trained on the Pile is well fit to such data as that pretraining corpus is mostly sourced from just such smaller, hand-picked sources. But as we wish to scale the total number of tokens in a corpus, the challenge becomes how to integrate more available Common Crawl data without losing sample efficiency on diverse evaluations such as this

Paloma metric. In this case we see that the Dolma baseline nearly matches the performance curve of the Pile baseline even though the fraction of Common Crawl data included is more than 4 times greater.

5 Releasing Dolma

Risk mitigation We recognize that any dataset derived from large web crawls will contain factually-incorrect information, toxic language, hate speech, PII, and other types of harmful content. While we have made an effort to curate this dataset taking this into consideration, we believe risk mitigation is best approached from multiple directions, including careful consideration of licenses and access controls.

Copyright While most datasets we used were curated with copyright and licensing in mind (e.g., open access papers in peS2o (Soldaini and Lo, 2023), open source repositories in the Stack (Kocetkov et al., 2022)) or were already permissively licensed (e.g., Wikipedia is released under a Creative Commons license), we recognize that large web crawls will also contain copyrighted material. Yet, given current tools, it’s not possibly to reliably or scalably detect copyrighted materials in a corpus of this size. Our decision to release Dolma publicly factors in several considerations, including that all our data sources were publicly available and already being used in large-scale language model pretraining (both open and closed), we refer the reader to our public position on AI and fair use (Farhadi et al., 2023).

We recognize that the legal and ethical landscape of AI is changing rapidly, and we plan to revisit our choices as new information becomes available.

Previous: Model | Rephrasing Web Next: Model | OLMo

post contain ""

    No matching posts found containing ""