00:00:00

Decontamination | Pile

https://dsdanielpark.github.io https://github.com/dsdanielpark

Decontamination | Pile

MinWoo(Daniel) Park | Tech Blog

Created: 2024-01-28 10:14:03 +0000

Last modified: 2024-09-05 20:56:50 +0900

Decontamination | Pile

Related Project: Private
Category: Paper Review
Date: 2024-01-28

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

url: https://arxiv.org/abs/2101.00027
pdf: https://arxiv.org/pdf/2101.00027
abstract: Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
related paper: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only(https://arxiv.org/abs/2306.01116)

[Decontamination 핵심 색인마킹]

Contents

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

TL;DR

Component	Raw Size	Weight	Epochs	Effective Size	Mean Document Size
Pile-CC	227.12 GiB	18.11%	1.0	227.12 GiB	4.33 KiB
PubMed Central	90.27 GiB	14.40%	2.0	180.55 GiB	30.55 KiB
Books3†	100.96 GiB	12.07%	1.5	151.44 GiB	538.36 KiB
OpenWebText2	62.77 GiB	10.01%	2.0	125.54 GiB	3.85 KiB
ArXiv	56.21 GiB	8.96%	2.0	112.42 GiB	46.61 KiB
Github	95.16 GiB	7.59%	1.0	95.16 GiB	5.25 KiB
FreeLaw	51.15 GiB	6.12%	1.5	76.73 GiB	15.06 KiB
StackExchange	32.20 GiB	5.13%	2.0	64.39 GiB	2.16 KiB
USPTO Backgrounds	22.90 GiB	3.65%	2.0	45.81 GiB	4.08 KiB
PubMed Abstracts	19.26 GiB	3.07%	2.0	38.53 GiB	1.30 KiB
Gutenberg (PG-19)†	10.88 GiB	2.17%	2.5	27.19 GiB	398.73 KiB
OpenSubtitles†	12.98 GiB	1.55%	1.5	19.47 GiB	30.48 KiB
Wikipedia (en)†	6.38 GiB	1.53%	3.0	19.13 GiB	1.11 KiB
DM Mathematics†	7.75 GiB	1.24%	2.0	15.49 GiB	8.00 KiB
Ubuntu IRC	5.52 GiB	0.88%	2.0	11.03 GiB	545.48 KiB
BookCorpus2	6.30 GiB	0.75%	1.5	9.45 GiB	369.87 KiB
EuroParl†	4.59 GiB	0.73%	2.0	9.17 GiB	68.87 KiB
HackerNews	3.90 GiB	0.62%	2.0	7.80 GiB	4.92 KiB
Youtube Subtitles	3.73 GiB	0.60%	2.0	7.47 GiB	22.55 KiB
PhilPapers	2.38 GiB	0.38%	2.0	4.76 GiB	73.37 KiB
NIH ExPorter	1.89 GiB	0.30%	2.0	3.79 GiB	2.11 KiB
Enron Emails†	0.88 GiB	0.14%	2.0	1.76 GiB	1.78 KiB
The Pile	825.18 GiB			1254.20 GiB	5.91 KiB

서론

최근 일반 목적의 언어 모델링에서의 돌파구는 대규모 텍스트 데이터를 사용한 대규모 모델 훈련의 효과를 downstream 애플리케이션에서 입증하였습니다. 언어 모델 훈련이 계속해서 확대됨에 따라, 고품질의 대규모 텍스트 데이터에 대한 수요는 계속해서 증가할 것입니다.

언어 모델링에서 데이터에 대한 요구가 증가함에 따라, 대부분의 기존 대규모 언어모델은 대부분 또는 전부의 데이터로 커먼 크롤(Common Crawl)을 사용하게 되었습니다. 커먼 크롤에서의 훈련은 효과적이었지만, 최근의 연구는 데이터셋의 다양성이 downstream 일반화 능력을 향상시킨다는 것을 보여주었습니다. 또한, 대규모 언어모델은 해당 도메인에서 상대적으로 소량의 훈련 데이터만으로도 새로운 도메인의 지식을 효과적으로 습득할 수 있음이 입증되었습니다. 이런 결과는 소수의 데이터 소스만을 사용하여 훈련된 모델과 비교할 때, 많은 수의 작고, 고품질이며, 다양한 데이터셋을 혼합함으로써 모델의 일반적인 교차 도메인 지식과 downstream 일반화 능력을 향상시킬 수 있다는 것을 시사합니다.

이런 필요성을 해결하기 위해, 대규모 언어모델 훈련을 위해 설계된 825.18 GiB 규모의 영어 텍스트 데이터셋인 ‘더 파일(The Pile)’을 소개합니다. 더 파일은 22개의 다양하고 고품질의 데이터셋으로 구성되어 있으며, 기존의 자연어 처리 데이터셋과 여러 새로 도입된 데이터셋을 포함하고 있습니다. 더 파일은 대규모 언어모델의 훈련뿐만 아니라, 언어 모델의 교차 도메인 지식과 일반화 능력에 대한 광범위한 벤치마킹을 위해서도 유용합니다.

새로운 데이터셋은 PubMed Central, ArXiv, GitHub, FreeLaw Project, Stack Exchange, 미국 특허청, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, NIH ExPorter 등 다양한 출처에서 파생되었습니다. 또한, 원래의 OpenWebText와 BookCorpus 데이터셋의 확장판인 OpenWebText2와 BookCorpus2도 도입하였습니다.

이외에도 여러 기존 고품질 데이터셋—Books3, Project Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DM Mathematics, EuroParl, Enron Emails 코퍼스를 통합하였고, 향상된 추출 품질의 커먼 크롤 부분집합인 Pile-CC도 새롭게 도입하였습니다.

이 논문의 주요 기여는 다음과 같습니다

22개의 다양한 출처를 결합한 825.18 GiB 규모의 영어 언어 모델링 데이터셋의 도입.
독립적인 연구 관심을 불러일으킬 것으로 예상되는 14개의 새로운 언어 모델링 데이터셋의 도입.
CC-100 및 원시 커먼 크롤 데이터에 대한 훈련과 비교하여 GPT-2 크기 모델이 이 새로운 데이터셋에서 표시한 상당한 개선을 입증하는 평가.
이 데이터셋의 조사 및 문서화를 통해 연구자들이 이를 어떻게 사용할지에 대해 더 잘 알 수 있도록 정보를 제공하고, 유사한 조사를 자체 데이터에 대해 수행하도록 동기를 부여합니다.

1 Introduction

Recent breakthroughs in general-purpose language modeling have demonstrated the effectiveness of training massive models on large text corpora for downstream applications (Radford et al., 2019; Shoeybi et al., 2019; Raffel et al., 2019; Rosset, 2019; Brown et al., 2020; Lepikhin et al., 2020). As the ﬁeld continues to scale up language model training, the demand for high-quality massive text data will continue to grow (Kaplan et al., 2020).

The growing need for data in language modeling has caused most existing large-scale language models to turn to the Common Crawl for most or all of their data (Brown et al., 2020; Raffel et al., 2019). While training on the Common Crawl has been effective, recent work has shown that dataset diversity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale language models have been shown to effectively acquire knowledge in a novel domain with only relatively small amounts of training data from that domain (Rosset, 2019; Brown et al., 2020; Carlini et al., 2020). These results suggest that by mixing together a large number of smaller, high quality, diverse datasets, we can improve the general cross-domain knowledge and downstream generalization capabilities of the model compared to models trained on only a handful of data sources.

1 https://pile.eleuther.ai/

To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for training large scale language models. The Pile is composed of 22 diverse and high-quality datasets, including both established natural language processing datasets and several newly introduced ones. In addition to its utility in training large language models, the Pile can also serve as a broad-coverage benchmark for cross-domain knowledge and generalization ability of language models.

We introduce new datasets derived from the following sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Ofﬁce, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.

In addition, we incorporate several existing highquality datasets: Books3 (Presser, 2020), Project Gutenberg (PG-19) (Rae et al., 2019), OpenSubtitles (Tiedemann, 2016), English Wikipedia, DM Mathematics (Saxton et al., 2019), EuroParl (Koehn, 2005), and the Enron Emails corpus (Klimt and Yang, 2004). To supplement these, we also introduce a new ﬁltered subset of Common Crawl, Pile-CC, with improved extraction quality.

Figure 1: Treemap of Pile components by effective size.

1.1 Contributions

The core contributions of this paper are:

Through our analyses, we conﬁrm that the Pile is signiﬁcantly distinct from pure Common Crawl data. Additionally, our evaluations show that the existing GPT-2 and GPT-3 models perform poorly on many components of the Pile, and that models trained on the Pile signiﬁcantly outperform both raw and ﬁltered Common Crawl models. To complement the performance evaluations, we also perform an exploratory analysis of the text within the Pile to provide a detailed picture of the data. We hope that our extensive documentation of the construction and characteristics of the Pile will help researchers make informed decisions about potential downstream applications.

Finally, we make publicly available the preprocessing code for the constituent datasets of the Pile and the code for constructing alternative versions2. In the interest of reproducibility, we also document all processing performed on each dataset (and the Pile as a whole) in as much detail as possible. For further details about the processing of each dataset, see Section 2 and Appendix C.

2 https://github.com/EleutherAI/

The introduction of a 825.18 GiB english-language dataset for language modeling combining 22 diverse sources.
The introduction of 14 new language modeling datasets, which we expect to be of independent interest to researchers.
Evaluations demonstrating signiﬁcant improvements across many domains by GPT-2sized models trained on this new dataset, compared to training on CC-100 and raw Common Crawl.
The investigation and documentation of this dataset, which we hope will better inform researchers about how to use it as well as motivate them to undertake similar investigations of their own data.

2 The Pile Datasets

The Pile is composed of 22 constituent sub-datasets, as shown in Table 1. Following Brown et al. (2020), we increase the weights of higher quality components, with certain high-quality datasets such as Wikipedia being seen up to 3 times (“epochs”) for each full epoch over the Pile. Detailed information about the construction of each dataset is available in Appendix C.

Table 1: Overview of datasets in the Pile before creating the held out sets. Raw Size is the size before any up- or down-sampling. Weight is the percentage of bytes in the ﬁnal dataset occupied by each dataset. Epochs is the number of passes over each constituent dataset during a full epoch over the Pile. Effective Size is the approximate number of bytes in the Pile occupied by each dataset. Datasets marked with a † are used with minimal preprocessing from prior work.

2.1 Pile-CC

Common Crawl is a collection of website crawls from 2008 onwards, including raw web pages, metadata and text extractions. Due to the raw nature of the dataset, Common Crawl has the advantage of including text from diverse domains, but at the cost of varying quality data. Due to this, use of Common Crawl typically necessitates well-designed extraction and ﬁltering. Our Common Crawl-based dataset, Pile-CC, uses jusText (Endrédy and Novák, 2013) on Web Archive ﬁles (raw HTTP responses including page HTML) for extraction, which yields higher quality output than directly using the WET ﬁles (extracted plain text).

2.2 PubMed Central

PubMed Central (PMC) is a subset of the PubMed online repository for biomedical articles run by the United States of America’s National Center for Biotechnology Information (NCBI), providing open, full-text access to nearly ﬁve million publications. Most publications indexed by PMC are recent, and their inclusion is mandated for all NIH funded research starting from 2008 by the NIH Public Access Policy. We included PMC in the hopes that it will beneﬁt potential downstream applications to the medical domain.

2.3 Books3

Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020). Bibliotik consists of a mix of ﬁction and nonﬁction books and is almost an order of magnitude and other metadata, we focused speciﬁcally on court opinions due to an abundance of full-text entries. This data is entirely within the public domain.

larger than our next largest book dataset (BookCorpus2).We included Bibliotik because books are invaluable for long-range context modeling research and coherent storytelling.

2.4 OpenWebText2

OpenWebText2 (OWT2) is a generalized web scrape dataset inspired by WebText (Radford et al., 2019) and OpenWebTextCorpus (Gokaslan and Cohen, 2019). Similar to the original WebText, we use net upvotes on Reddit submissions as a proxy for outgoing link quality. OpenWebText2 includes more recent content from Reddit submissions up until 2020, content from multiple languages, document metadata, multiple dataset versions, and open source replication code. We included OWT2 as a high quality general purpose dataset.

2.5 ArXiv

ArXiv is a preprint server for research papers that has operated since 1991. As shown in ﬁg. 10, arXiv papers are predominantly in the ﬁelds of Math, Computer Science, and Physics. We included arXiv in the hopes that it will be a source of high quality text and math knowledge, and beneﬁt potential downstream applications to research in these areas. ArXiv papers are written in LaTeX, a common typesetting language for mathematics, computer science, physics, and some adjacent ﬁelds. Training a language model to be able to generate papers written in LaTeX could be a huge boon to the research community.

2.6 GitHub

GitHub is a large corpus of open-source code repositories. Motivated by the ability of GPT-3 (Brown et al., 2020) to generate plausible code completions despite its training data not containing any explicitly gathered code datasets, we included GitHub in the hopes that it would enable better downstream performance on code-related tasks.

2.7 FreeLaw

The Free Law Project is a US-registered non-proﬁt that provides access to and analytical tools for academic studies in the legal realm. CourtListener,3 part of the Free Law Project, provides bulk downloads for millions of legal opinions from federal and state courts. While the full dataset provides multiple modalities of legal proceedings, including dockets, bibliographic information on judges,

2.8 Stack Exchange

The Stack Exchange Data Dump4 contains an anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers. It is one of the largest publicly available repositories of question-answer pairs, and covers a wide range of subjects—from programming, to gardening, to Buddhism. We included Stack Exchange in the hopes that it will improve the question answering capabilities of downstream models on diverse domains.

2.9 USPTO Backgrounds

USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Ofﬁce, derived from its published bulk archives5. A typical patent background lays out the general context of the invention, gives an overview of the technical ﬁeld, and sets up the framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a non-technical audience.

2.10 Wikipedia (English)

Wikipedia is a standard source of high-quality text for language modeling. In addition to being a source of high quality, clean English text, it is also valuable as it is written in expository prose, and spans many domains.

2.11 PubMed Abstracts

PubMed Abstracts consists of the abstracts from 30 million publications in PubMed, the online repository for biomedical articles run by the National Library of Medicine. While the PMC (see Section 2.2) provides full-text access, the subset of coverage is signiﬁcantly limited and biased towards recent publications. PubMed also incorporates MEDLINE, which expands the coverage of biomedical abstracts from 1946 to present day.

3 https://www.courtlistener.com/

4 https://archive.org/details/stackexchange

5 https://bulkdata.uspto.gov/

2.12 Project Gutenberg

Project Gutenberg is a dataset of classic Western literature. The speciﬁc Project Gutenberg derived dataset we used, PG-19, consists of Project Gutenberg books from before 1919 (Rae et al., 2019), which represent distinct styles from the more modern Books3 and BookCorpus. Additionally, the PG19 dataset is already being used for long-distance context modeling.

2.13 OpenSubtitles

The OpenSubtitles dataset is an English language dataset of subtitles from movies and television shows gathered by Tiedemann (2016). Subtitles provide an important source of natural dialog, as well as an understanding of ﬁctional formats other than prose, which may prove useful for creative writing generation tasks such as screenwriting, speechwriting, and interactive storytelling.

2.14 DeepMind Mathematics

The DeepMind Mathematics dataset consists of a collection of mathematical problems from topics such as algebra, arithmetic, calculus, number theory, and probability, formatted as natural language prompts (Saxton et al., 2019). One major weakness of large language models has been performance on mathematical tasks (Brown et al., 2020), which may be due in part to a lack of math problems in the training set. By explicitly including a dataset of mathematical problems, we hope to improve the mathematical ability of language models trained on the Pile.

2.15 BookCorpus2

BookCorpus2 is an expanded version of the original BookCorpus (Zhu et al., 2015), a widely used language modeling corpus consisting of books written by “as of yet unpublished authors.” BookCorpus is therefore unlikely to have signiﬁcant overlap with Project Gutenberg and Books3, which consist of published books. BookCorpus is also commonly used as dataset for training language models (Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019).

2.16 Ubuntu IRC

The Ubuntu IRC dataset is derived from the publicly available chatlogs6 of all Ubuntu-related channels on the Freenode IRC chat server. Chatlog data provides an opportunity to model real-time human interactions, which feature a level of spontaneity not typically found in other modes of social media.

6 https://irclogs.ubuntu.com/

2.17 EuroParl

EuroParl (Koehn, 2005) is a multilingual parallel corpus originally introduced for machine translation but which has also seen use in several other ﬁelds of NLP (Groves and Way, 2006; Van Halteren, 2008; Ciobanu et al., 2017). We use the most current version at time of writing, which consists of the proceedings of the European Parliament in 21 European languages from 1996 until 2012.

2.18 YouTube Subtitles

The YouTube Subtitles dataset is a parallel corpus of text gathered from human generated closedcaptions on YouTube. In addition to providing multilingual data, Youtube Subtitles is also a source of educational content, popular culture, and natural dialog.

2.19 PhilPapers The PhilPapers7 dataset consists of open-access philosophy publications from an international database maintained by the Center for Digital Philosophy at the University of Western Ontario. We included PhilPapers because it spans a wide body of abstract, conceptual discourse, and its articles contain high quality academic writing.

2.20 NIH Grant Abstracts

ExPORTER

The NIH Grant abstracts provides a bulk-data repository for awarded applications through the ExPORTER8 service covering the ﬁscal years 1985present. We included the dataset because it contains examples of high-quality scientiﬁc writing.

2.21 Hacker News

Hacker News9 is a link aggregator operated by Y Combinator, a startup incubator and investment fund. Users submit articles deﬁned as “anything that gratiﬁes one’s intellectual curiosity,” but submitted articles tend to focus on topics in computer science and entrepreneurship. Users can comment on submitted stories, resulting in comment trees discussing and critiquing submitted stories. We scrape, parse, and include these comment trees since we believe they provide high quality dialogue and debate on niche topics.

7 https://philpapers.org/

8 https://exporter.nih.gov/

9 https://news.ycombinator.com

2.22 Enron Emails

The Enron Emails dataset (Klimt and Yang, 2004) is a valuable corpus commonly used for research about the usage patterns of email. We included Enron Emails to aid in understanding the modality of email communications, which is typically not found in any of our other datasets.

3 Benchmarking Language Models with

While the Pile was conceived as a training dataset for large-scale language models, its coverage of multiple disparate domains makes it also suitable as an evaluation dataset. In this section, we describe how the Pile can be used as a broad-coverage dataset for benchmarking language models.

3.1 Benchmarking Guidelines

The Pile is provided as train, validation, and testing splits. The validation and testing components each contain 0.1% of the data, sampled uniformly at random. While this is a far smaller percentage than most datasets, the sheer size of the dataset results in over 1 GiB of validation and testing data each. We highlight that while we have made efforts to deduplicate documents within the Pile (See: Section D.2), it is still possible that some documents are duplicated across the train/validation/test splits.

Our preferred metric is bits per UTF-8 encoded byte (BPB). Bits per byte is preferred over bits per character or perplexity when using Pile as a metric due to its invariance to different tokenization schemes and the ambiguity of measuring characters in Unicode. To compute bits per byte from a given negative log likelihood loss (cid:96), we compute BPB = (LT /LB) log2(e(cid:96)) = (LT /LB)(cid:96)/ ln(2), where LT is the length of the dataset in tokens and LB is the length of the dataset in UTF-8 encoded bytes. We ﬁnd that LT /LB is 0.29335 GPT-2tokens/byte across the Pile; dataset-speciﬁc values of LT /LB can be found in Table 7.

3.2 Test Perplexity with GPT-2 and GPT-3

We compute the test perplexity of the constituent datasets of the Pile using GPT-2 (Radford et al.,

2019) and GPT-3 (Brown et al., 2020), shown in Figure 2. We use all available versions of GPT-2, and all four versions of GPT-3 available via the OpenAI API. Because of the cost associated with using the OpenAI API, we evaluate on one-tenth of the respective test sets for most of the constituent datasets. We report the perplexity converted to bits per UTF-8 encoded byte (BPB). Importantly, we compute perplexity by evaluating each document independently within each dataset, as opposed to concatenating all documents as is common practice for computing perplexity on large corpora.

Full details of the perplexity computation can be found in Appendix E.2.

Unsurprisingly, larger language models generally attain lower perplexity compared to smaller models. Recent work has shown an increased focus on the empirical scaling laws of language models (Kaplan et al., 2020; Henighan et al., 2020). As such, we investigate the scaling law for the GPT-2 and GPT-3 families of models on perplexity evaluation on the Pile. The scaling law relation for the GPT-3 family of models is shown in Figure 2.10 The line of best ﬁt shown in the ﬁgure has a coefﬁcient of -0.1674 and an intercept of 2.5516.

Figure 2: Scaling law for performance of GPT-2/3 mod- els. ‘Zero-shot’ refers to the fact that none of the mod- els have been ﬁne-tuned on data from the Pile.

Interestingly, while GPT-2 and GPT-3 were not trained on the Pile, there still appears to be a clear scaling law without diminishing returns. We hy- pothesize that this is due to the inherent generaliza- tion capability of these models. We leave a more rigorous analysis of zero-shot scaling laws to future work.

10 While the sizes of GPT-3 models on the OpenAI API have not been publicized, we assume here that ada, babbage, curie and davinci models correspond to 2.7B, 6.7B, 13B and 175B parameter models respectively.

3.3 Relative Componentwise GPT-3 Pile

Performance

Determining which components GPT-3 underper- forms on provides information about which Pile components are most dissimilar to the distribution of text (web pages and books) that GPT-3 was trained on. These components would thus make es- pecially good candidates for supplementing GPT-3 training data. These results are also valuable for determining which types of datasets to emphasize for future iterations of the Pile.

Due to the difference in entropy of different datasets, directly comparing perplexity of GPT-3 on different Pile components is not an accurate in- dication of relative performance. Ideally we would train a GPT-3 model from scratch on the Pile and compare the difference in loss per dataset with that of the original GPT-3. Because of resource constraints, we instead use a GPT-2 model trained from scratch on the Pile (see Section 4) to con- struct a proxy measure. To construct our proxy, we ﬁrst measure the improvement from the GPT- 2-Pile model to GPT-3 on each component. Then, we normalize our results by setting the change on OpenWebText2 to be zero. This computation is shown in the equation below:

Since GPT2-Pile was trained on both OWT2 and the dataset we are evaluating, we expect the second term in ∆set to reﬂect the difference in the intrinsic difﬁculty of the two datasets. Thus the total value of ∆set reﬂects how much harder the dataset we are evaluating was for GPT-3 than OWT2, minus the relative difﬁculty of the two tasks. As GPT-3 was trained on data very similar to OWT2, this gives us a proxy for how much better GPT-3 would do if it were trained on the Pile.

The results are shown in Figure 3. As a san- ity check, we observe that datasets that are con- tained in, or are extremely similar to, GPT-3’s training set (Books3, Wikipedia (en), Pile-CC and Project Gutenberg) score close to zero on our met- ric.

GPT-3 appears to perform poorly on datasets pertaining to research or academic writing like PubMed Central, PubMed Abstracts, and ArXiv; domain-speciﬁc datasets like FreeLaw, Hack- erNews, and USPTO Backgrounds; and on datasets containing predominantly text distinct from natu- ral language, like GitHub and DM Mathematics. In addition, the majority of datasets see less of an improvement than OpenWebText2. As such, we ex- pect a GPT-3 sized model trained on Pile to perform signiﬁcantly better on research related tasks, soft- ware tasks, and symbol manipulation tasks than the base model. Additionally, this experiment provides evidence that the majority of Pile components are not redundant with the predominantly web-based GPT-3 training data.

We note that this metric is only a proxy for similar- ity, and that it could be confounded by dataset spe- ciﬁc scaling effects. Although our results largely accord with expectations, there are some puzzling results, like the datasets on which GPT-3 outper- formed GPT-2 Pile. We hypothesize that GPT-3 learns to be so good at these datasets that train- ing on them explicitly does not notably beneﬁt the model’s performance. We leave a more rigorous analysis of these effects for future work.

4 Evaluation

To conﬁrm the effectiveness of the Pile for im- proving language modeling quality, we train architecturally-identical 1.3 billion parameter mod- els based on those in Brown et al. (2020) on dif- ferent datasets and evaluate on the WikiText and LAMBADA tasks as benchmarks of language mod- eling ability. We also report results on the Pile as a measure of more cross-domain generaliza- tion.

4.1 Methodology

To ensure a fair comparison across datasets of dif- ferent sizes, we decontaminate any instances of the evaluation sets using the same 13-gram overlap ﬁl- tering as in Brown et al. (2020) and downsample to 40GB to control for dataset size. As we control for dataset size, we emphasize that our evaluation is generous to CC-100 (en), which is about 1/3 the size of the Pile in reality.

We compare the following datasets: the Pile, the English component of the CC-100 dataset11 (Wenzek et al., 2019; Conneau et al., 2020), and a sample of raw CC WET ﬁles ﬁltered for English-only.

Table 2: Test perplexity of the Pile using GPT-2 and GPT-3, converted to bits per UTF-8 encoded byte (BPB). Evaluation is performed on one-tenth of the test data of the Pile, on a per-document basis. Bold indicates the best-performing model in each row.

4.2 Results

On traditional language modeling benchmarks, the Pile improves signiﬁcantly on WikiText and shows negligible changes in LAMBADA. However, mod- els trained on Pile improve signiﬁcantly over both Raw CC and CC-100 on all components of the Pile, as shown in Table 4. This indicates that mod- els trained on the Pile have greater cross-domain generalization capabilities without compromising performance on traditional benchmarks.

The magnitude of improvement over CC-100 per set is shown in Figure 4. Unsurprisingly, there is almost no improvement on Pile-CC. However, the model trained on the Pile performs signiﬁ- cantly better than either of the other models on academic datasets such as ArXiv, Pubmed Central, FreeLaw, and PhilPapers. It also improves signiﬁcantly on programming-related datasets like Github and StackExchange, on EuroParl, due to the lack of multilingual text in either other dataset, and on DM Mathematics, indicating a signiﬁcant improvement in mathematical ability.

11 The data was obtained from http://data.statmt.org/cc-100/.

Surprisingly, raw Common Crawl performs better on the Pile BPB than CC-100, despite losing by a signiﬁcant margin on LAMBADA and WikiText. We hypothesize that this is due to the perplexity based ﬁltering used in CC-100, where a language model is trained on Wikipedia and all data with a perplexity too high or too low is discarded. This effectively discards any data too similar to or too different from Wikipedia, which severely limits the diversity of the collected data. This result suggests that future work using Common Crawl should take caution with ﬁltering to preserve its diversity.

5 Structural Statistics

In this section, we cover the Structural Statistics of the dataset, which provide more coarse-grained and statistical information about the Pile. In Section 6, we provide a closer investigation and doc- umentation of the textual content within the Pile datasets.

Figure 3: Change in BPB from GPT-2 trained on Pile to GPT-3 zero-shot, relative to OpenWebText2 BPB change. Dotted line indicates overall Pile change. Lower indicates better relative performance by GPT-3.

Table 3: Size-controlled evaluation results. Each dataset is deduplicated against all evaluation metrics and subsam- pled to approximately 40GB to control for the effects of dataset size. For LAMBADA, we use the variant of the data introduced in Radford et al. (2019) and only evaluate the perplexity on the ﬁnal token rather than the ﬁnal word. For WikiText, we report the perplexity per GPT-2 token. † indicates that the size is an estimate.

5.1 Document Lengths and Tokenization

Each dataset consists of a large number of documents. We analyze the distribution of document lengths, as well as the number of bytes-per-token using the GPT-2 tokenizer in order to put our ablations in context.

While the majority of documents in the Pile are short, there is a long tail of very long documents (Figure 5).

Since the GPT-2 BPE tokenizer is trained on Web-Text, the mean bytes per token is also a very rough indicator of how syntactically different each Pile component is from WebText. For instance, datasets like NIH ExPorter, OpenWebText2 and Books3

consist largely of ordinary text in a similar distribution to WebText, which is reﬂected in a greater number of bytes per token. On the other hand, many of the sets with the lowest bytes per token are those which consist in large part of non-text content (Github, ArXiv, Stack Exchange, and DM Mathematics) or languages other than English (EuroParl).

5.2 Language and Dialects

While only 13% of the world’s population speaks English, the vast majority of NLP research is done on English. For the Pile, we took a similar approach to the dataset used by Brown et al. (2020) and focused predominantly on English, while also not explicitly ﬁltering out other languages when collecting our own data. When evaluating a multilingual dataset, our main criteria for inclusion was whether the English component of the dataset merited inclusion alone.

Figure 4: Magnitude of BPB improvement of Pile model over CC-100 model on each test set.

Figure 5: Distribution of document lengths in Pile. The highest 1 percentile of document length are considered to be outliers and excluded from this plot.

6 Investigating and Documenting the Datasets

As the scale of machine learning research has grown, scrutiny has been placed on the ever larger datasets that models are trained on (Prabhu and Birhane, 2020; Biderman and Scheirer, 2020)

Using fasttext (Suárez et al., 2019a), we deter- mine that the Pile is 97.4% English. We note that due to issues with language identiﬁcation, partic- ularly with rare languages Caswell et al. (2020), this methodology provides only a rough estimate for English content and no reliable conclusions for low-resource languages can be drawn.

While this issue has been raised within AI ethics and bias research (Hovy and Spruit, 2016; Hutchin- son et al., 2020; Blodgett et al., 2020), it has not been a focal point of concern within the language modeling community.

The second, the data state- ments methodology (Bender and Friedman, 2018), was proposed speciﬁcally for natural language pro- cessing and has been well received by the NLP community. Our datasheet and data statement will be featured in the GitHub repository where the code for the Pile is stored and will also be available as separate documents on arXiv (Biderman et al., 2021; Biderman, 2021).

In addition to the datasheet and data statement, there is additional information that may be helpful to people training language models that these doc- uments do not cover. In the rest of this section we investigate and document in greater detail some of this additional contextual information.

6.2 Topical Distribution

In order to better understand the speciﬁc subject matter covered by the Pile, we performed a topic modeling analysis on its components. Using Gen- sim (Rehurek et al., 2011), we trained 16-topic La- tent Dirichlet Allocation (Blei et al., 2003) models on each component of the validation set of the Pile concurrently, in an online fashion (Hoffman et al., 2010). We ﬁltered the Pile for English only for this analysis. Afterwards, we computed the perplex- ity of the Common Crawl-derived (Pile-CC) topic model on the document sets of the other compo- nents. In this way, we provide a rough measure of the degree to which parts of the Pile contain topics not well covered within Common Crawl.

In Figure 7, these cross-component perplexities are shown, with a vertical line indicating the perplexity of the Pile-CC topic model evaluated on the doc- uments of OpenWebText2. This component was chosen as a baseline of comparison for similar rea- sons as in the previous evaluation: it is derived in a similar manner (ﬁltered crawls of the open web) as the Common Crawl, and thus is expected to contain a similar distribution of topics. Although Pile-CC is somewhat diverse in its content, several of the Pile’s other components deviate from it strongly in their topical focus, as evidenced by higher perplex- ity on Github, PhilPapers, and EuroParl.

We also documented the topical clusters inferred from our LDA models for each component, which we provide in Appendix C. As expected, though the larger CC-derived component itself represents a diversity of content—including politics, education, language processing technologies are Natural widely applicable and can be used in extremely different contexts. What is and is not appropriate data to train on can therefore vary wildly with the application context. In our view, the best approach is to document rather than eliminate potentially con- cerning aspects of datasets13, particularly since the purpose of the Pile is to train general-purpose lan- guage models. The primary goal of our documen- tation, therefore, is to empower NLP researchers to make informed decisions.

Figure 6: Mean bytes per GPT-2-token for each dataset in the Pile. Error bars indicate standard deviation.

2018; Jo and Gebru, 2020), no dataset intended to train massive language models has been seri- ously documented by its creators12. Therefore, our analyses serve two goals: to address ethical con- cerns about the Pile, and to promote and normalize the practice of engaging with the AI ethics litera- ture.

6.1 Documenting Methods

To document the Pile, we chose to implement two frameworks that have been proposed by method- ologists and ethics researchers. The ﬁrst, the datasheets methodology (Gebru et al., 2018), is a general purpose methodology that is recommended by several methodologists (Raji and Yang, 2019; Biderman and Scheirer, 2020) and appears to be used more frequently by practitioners than alternasports and entertainment—the content clusters it misses become apparent when compared qualita- tively to other components of the Pile. Notably, the data modes covering programming, logic, physics, and legal knowledge appear largely absent.

12 Brown et al. (2020) discusses ethical issues surrounding their model, but do not discuss those surrounding the training dataset itself.

13 That said, we did exclude several datasets, see Appendix B for details.

6.3 Pejorative Content

Due to the wide diversity in origins, it is possible for the Pile to contain pejorative, sexually explicit, or otherwise objectionable content. As this content may not be desirable for some use cases, we break down profanity on a per-dataset level.

We used the profanity-checker Python package (Zhou, 2019). This package includes a “toxicity model” trained on multiple profanity lists as well as the Wikidetox Toxic Comment Dataset (Wulczyn et al., 2016) and classiﬁes a given string as being profane or not profane.

We considered only the English sentences in each dataset using the same language classi- ﬁer from Section 3.7. We did this since profanity-checker is built for English and other languages may improperly impact the results. For instance, the German nominative/accusative feminine/plural deﬁnite article “die” is ﬂagged as being profane regardless of context. We split each sentence into words and computed the percentage of words that are ﬂagged as profane for each com- ponent of the Pile. We emphasize that this method- ology is only a proxy for profanity, given the com- plexity of determining whether a given word or phrase is profane in context.

As shown in Figure 8, the Pile as a whole appears less profane than Pile-CC. Further, the majority of Pile components appear less profane than Pile-CC as well.

We also broke each dataset down on a sentence level, to allow profanity-checker to check entire sentences. Splitting datasets by sentence allows for additional context to be considered when determining whether content is pejorative. Our results are shown in Figure 12.

6.4 Bias and Sentiment Co-occurrence

As language models may pick up unexpected biases from the training data, we performed a preliminary analysis of the different components that make up the Pile. Because models with different charac- teristics may be trained on the Pile, we aimed to document the biases of the data and not a speciﬁc

model. We primarily focus on co-occurrence tests, where we analyzed what words occur in the same sentence as other speciﬁc words. Using this infor- mation, we can estimate what words strongly bias towards a category word, as well as calculate the general sentiment of surrounding words.

We focused our analysis on gender, religion, and race. Our goal is to provide users of this dataset with preliminary guidance on how the different components are biased so that they can make deci- sions on which components to train on.

All tables and ﬁgures in this section can be found in the Appendix.

6.4.1 Gender

We computed gender associations by computing co- occurrences for binary pronouns. For each word, we computed the difference in the rate it co-occurs with “he” and “she”14 and weighed it by the square root of its frequency. We report the top 15 most biased adjectives or adverbs (Loper and Bird, 2002) for each in Table 10. We see that words like “mil- itary”, “criminal”, and “offensive” strongly bias towards men, while “little”, “married”, “sexual”, and “happy” bias towards women.

In addition, we computed the average senti- ment (Baccianella et al., 2010) of words co- occurring with the gendered pronouns across each dataset in Figure 13. Generally, we ﬁnd no sig- niﬁcant sentiment bias towards men or women. This, of course, does not mean that the dataset is free of gender bias (as our co-occurrence tests show).

6.4.2 Religion

We computed a similar co-occurrence analysis for religion, which can be found in Table 11. Like gen- der, we ﬁnd that these co-occurrences reﬂect how these terms are used in pockets of online discourse. For example, “radical” co-occurs with “muslim” at a high rate, while “rational” often co-occurs with “atheist”. This analysis also demonstrates some of the limitations of a purely co-occurrence based analysis. For example, “religious” often co-occurs with “atheist”, which likely reﬂects the type of con- versations in which the word “atheist” is likely to occur as opposed to a descriptor of “atheist”.

14 We chose to only study male and female pronouns as a simplifying assumption. Studying “they” would require us to isolate its usage as a singular noun.

Figure 7: Log perplexity of 16-topic LDA trained on Pile-CC, on other Pile components. Dotted line indicates log perplexity of the topic model on OpenWebText2. Higher indicates a larger topical divergence from Pile-CC.

occurences with phrases like “black man” or “white woman”.

We show the top 15 most biased words for each demographic in Table 12. Once again, we found that the co-occurrences reﬂect the context in which these terms are used. For example, the 4 most biased words for “black” are “unarmed”, “civil”, “criminal”, and “scary”.

Similar to above, we compute the average senti- ment of co-occurring words. We report the average sentiment numbers in Table 13. We ﬁnd that “his- panic/latino” narrowly edges out “asian” for the highest sentiment, followed by “white”. On the other hand, “black” had the lowest sentiment, at -0.15.

We note that for all demographics, the average sen- timent is negative. We hypothesize that this is due to the speciﬁc context for which the phrases we use to compute co-occurrences appear. For example, it is often quite common for news articles to describe suspects as an “asian man”.

Another issue with the use of texts in natural lan- guage processing research is consent. Although one is typically not legally obligated to receive the permission of an author to train a NLP algorithm on their work15, many consider doing so a moral obli-

15 Laws vary by country. For a discussion of US law, see Section 7.1

Figure 8: Percentage of words classiﬁed as profane in the Pile. The percentage of the CC component and the weighted mean of the Pile as a whole are shown as hor- izontal lines.

In addition, we computed the average sentiment of co-occurrences across each of the constituent datasets in Figure 14. Over the entire dataset, we ﬁnd that “Buddhist” has the highest sentiment, followed by “Hindu”, “Christian”, “Atheist”, and “Muslim”. Notably, “Jew” is the lowest, perhaps reﬂecting its historical use as a pejorative.

6.4.3 Race

Finally, we ran the same analysis for racial groups. Here, as identiﬁers like “black” or “white” of- ten do not indicate race, we instead compute cogation or a good measure to guard against misuse (Obar, 2020; Prabhu and Birhane, 2020). On the other hand, there is signiﬁcant disagreement sur- rounding the ethics of repurposing data protected by terms of service in research contexts (Vitak et al., 2016; Fiesler et al., 2020), particularly given the power asymmetries inherent in digital platforms, which often close off independent researchers from investigating public data while simultaneously com- pelling users to consent to its private use (Halavais, 2019).

While much of the Pile’s data comes from sources that have expressly consented to its wider dissemi- nation and use in research, researchers often fail to clearly document where their data came from and under what terms its use was consented to. In light of this, we felt it appropriate to release the Pile with transparency around how the authors of its data have indicated that that data can be used.

To provide needed nuance to our discussion of con- sent, we identiﬁed three tiers of availability for public use. Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web. Terms of Service (ToS) compliant data is data which is ob- tained and used in a fashion that is known to be consistent with the terms of service of the data host. Data with authorial consent is data for which the original authors of the work consented to the use of their data, or where a reasonable person could not assume that their data would not be used for purposes such as research. ToS compliant data and authorial consented data differ in two main ways: It is important to keep in mind that people typically do not read Terms of Service, and additionally that being ToS-compliant does not entail authorial con- sent. We adopted a strict model of consent, where ambiguous or unknown consent is treated as non- consensual.

Table 5 summarizes our understanding of the status of each of the datasets within the Pile. Datasets marked with a (cid:51)are compliant in the relevant re- spects, though a couple datasets are worth remark- ing on in particular. Book3 and OpenSubtitles are being used in a fashion that is consistent with the terms of service of the data host. However, this is somewhat misleading in that the data host is not authorized to post the data online by the parties that own it. The Enron Emails dataset was not collected with the permission of the authors, but was collected by the U.S. government as part of a criminal investigation. While the people whose emails are in the Enron dataset are aware of this fact, they were not given the ability to consent to its inclusion in any way.

There are ﬁve datasets included in the Pile that were not collected and distributed in a ToS compliant fashion and for which the authors had no ability to consent to their data being used. Each of these datasets are widely used, both in the NLP litera- ture and the world at large. With the exception of the YouTube Subtitles dataset, each of these datasets were published by researchers and are passed around freely on the internet. The YouTube Subtitles dataset was created by us for this project, using a very popular unofﬁcial API that is both widely used and easily obtainable on Pip, Conda, and GitHub, among other places. Given the pro- cessing applied and the difﬁculty of identifying par- ticular ﬁles in the Pile, we feel that our use of these datasets does not constitute signiﬁcantly increased harm beyond that which has already been done by the widespread publication of these datasets.

7 Implications and Broader Impacts

The Pile represents yet another stepping stone along the path of scaling models and datasets to ever larger sizes and capabilities. There are many serious concerns about how the emergence of pro- gressively stronger AI systems will inﬂuence the wider world (Brundage et al., 2018; Amodei et al., 2016; Bostrom and Yudkowsky, 2014; Bostrom, 2014; Critch and Krueger, 2020), and we believe In this section that they merit serious thought. we discuss the legal ramiﬁcations of the Pile, and then consider the impact of the Pile to AI align- ment from two angles: accelerating AI timelines and the dangers posed by unaligned language mod- els.

7.1 Legality of Content

While the machine learning community has be- gun to discuss the issue of the legality of training models on copyright data, there is little acknowl- edgment of the fact that the processing and dis- tribution of data owned by others may also be a violation of copyright law. As a step in that direction, we discuss the reasons we believe that our use of copyright data is in compliance with US copyright law.16

Table 5: Types of consent for each dataset

Under pre (1984) (and afﬁrmed in subsequent rulings such as aff (2013); Google (2015)), non- commercial, not-for-proﬁt use of copyright media is preemptively fair use. Additionally, our use is transformative, in the sense that the original form of the data is ineffective for our purposes and our form of the data is ineffective for the purposes of the original documents. Although we use the full text of copyright works, this is not necessarily dis- qualifying when the full work is necessary (ful, 2003). In our case, the long-term dependencies in natural language require that the full text be used in order to produce the best results (Dai et al., 2019; Rae et al., 2019; Henighan et al., 2020; Liu et al., 2018).

Additional restrictions on some of these works in particular jurisdictions. To enable easier compli- ance with local laws, the Pile reproduction code is available and can be used to exclude certain com- ponents of the Pile which are inappropriate for the user. Unfortunately, we do not have the meta- data necessary to determine exactly which texts are copyrighted, and so this can only be undertaken at the component level. Thus, this should be be taken to be a heuristic rather than a precise determina- tion.

7.2 Acceleration of AI Timelines

There is serious concern that AI systems may soon be meaningfully more capable than humans in all relevant economic tasks (Grace et al., 2018; Yud- kowsky, 2013). Relatedly, there are serious unre- solved questions surrounding how to properly align such powerful AI systems with human interests (Bostrom and Yudkowsky, 2014; Russell, 2019; Bostrom, 2014; Amodei et al., 2016) and generally avoid morally catastrophic outcomes (Sotala and Gloor, 2017; Shulman and Bostrom, 2020). As such, it has been argued that accelerating the de- velopment of such powerful AI systems may be undesirable before these concerns have been more adequately addressed (Bostrom, 2014).

There are several pragmatic responses to this view:

Due to human competition, curiosity, and cul- tural diversity, halting technological develop- ment is incredibly difﬁcult, if not impossible. (Russell, 2019) (Critch and Krueger, 2020)
AI development is experimental in nature: The alignment problem can only be solved through development, testing and (hopefully non-existential) failure.
High powered language models, along with their more general successors, must be capa- ble of viewing morally problematic content without adopting it in their output. We elabo- rate on this in the following section.

Copyright law varies by country, and there may be With this in mind, we accept the reality that the Pile could potentially accelerate AI timelines. However, we hope our efforts to establish best practices, such as thoroughly documenting the contents of our data, will help encourage diligence for downstream re- searchers on alignment problems.

16 This discussion does not, and is not intended to, constitute legal advice; rather, it is a general discussion of law. Only your attorney can provide assurances that the information contained herein is applicable or appropriate to a particular situation. If in doubt, it is always advisable to speak to an intellectual property attorney.

7.3 Negative LM Output

There has been much discussion about the possi- ble negative effects of powerful language models in the world (Brown et al., 2020; Brundage et al., 2018). Some of these possible problems, such as the ability to mass produce low quality content for the purpose of Search Engine Optimization, are inherent problems to the way online content is distributed, and cannot be stopped by those de- veloping language models alone. Directly solving these problems would require sweeping changes to the architecture of the Internet, such as vastly expanded Public Key Infrastructure and distributed authentication of identity (Ferguson and Schneier, 2003).

Another concern is that training such models on huge datasets will almost inevitably require them to have undesirable content in their training sets, such as that promoting hateful stereotypes (Christian, 2020). Having models output undesirable content is, by deﬁnition, undesirable, but we believe that attacking this problem from the training set side is unproductive and ultimately leads us away from optimal solutions. If a person reads a racist piece of content, they do not then immediately adopt its racist views—they may be capable of doing so, but can decide not to. This capacity to understand un- desirable content and then decide to ignore it is an essential future research direction. Not only would this allow models to use “dirtier” data with less concern, but also to use their gained knowledge to better understand what not to do. We recognize that, despite recent progress in human-guided learn- ing (Stiennon et al., 2020), the technology is not yet at this stage, and have thus made a number of editorial decisions as described in this paper. How- ever, this approach seems essential to the future of these models and AI more broadly, and more research is needed.

Self-supervised training of natural language pro- cessing models on large, unlabeled text corpora, has seen widespread adoption in the ﬁeld. Word representation models such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) were trained on datasets such as Wikipedia, Giga- word (Graff et al., 2003), or a non-public Google News corpus. More recently, language models (Radford et al., 2018, 2019; Brown et al., 2020;

Rosset, 2019; Shoeybi et al., 2019) and masked lan- guage models (Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2019) have been trained on datasets such as Wikipedia, BookCorpus (Zhu et al., 2015), RealNews (Zellers et al., 2019), CC-Stories (Trinh and Le, 2018), and other Internet scrape-derived datasets discussed below. Other datasets such as WikiText (Stephen et al., 2016) have also been used in similar self-supervised training.

As data requirements for language modeling have grown, the ﬁeld has turned towards Internet scrapes for large-scale datasets (Gokaslan and Cohen, 2019), with Common Crawl being particularly prevalent. Works such as Brown et al. (2020); Wen- zek et al. (2019); Suárez et al. (2019b); Raffel et al. (2019) have relied on Common Crawl to build train- ing datasets for large-scale models. However, these works often highlight the difﬁculty of cleaning and ﬁltering the Common Crawl data, and often high- light the resulting data quality as a determining factor of model capability.

It has also been increasingly common practice to combine multiple datasets when training lan- guage models. For instance, GPT (Radford et al., 2018) was trained on Wikipedia and BookCorpus, whereas GPT-3 (Brown et al., 2020) was trained on Wikipedia, two ﬁction datasets, and two web- scraped datasets. The Pile continues the trend of combining large-scale web-scrapes with smaller, higher-quality datasets that capture knowledge we believe would be most beneﬁcial to training lan- guage models.

The two most comparable publicly available datasets to the Pile are CC-100 (Wenzek et al., 2019) and C4/mC4 (Raffel et al., 2019). C4 is comparably-sized to the Pile, while mC4 and CC- 100 are larger, multilingual datasets. However, C4/mC4 require immense computational resources to preprocess the data, with its maintainers even rec- ommending the use of a distributed cloud service,17 setting a high bar of entry to using these datasets. CC-100 is directly downloadable and pre-cleaned; however, its English portion is much smaller than the Pile. Importantly, these three datasets are all de- rived entirely from Common Crawl—as discussed above, the current best practice in training large- scale language models involve using both large web scrapes and more targeted, higher-quality datasets, which the Pile directly addresses.

17 https://www.tensorflow.org/datasets/

Appendices

A Contributions

All authors contributed to the design of the research project and the writing of the paper. Additionally, authors contributed as follows:

Leo Gao led the project, implemented the main Pile codebase, contributed to the model training code, performed the evaluations and the language analysis, interpreted the perplexity analysis results, implemented the processing to create the ﬁnal data, and processed Pile-CC, PubMed Central, ArXiv, and Ubuntu IRC. Stella Biderman led the data analysis, the broader impact analysis, and the data documentation, and coordinated the project. She also wrote the anal- ysis of structural statistics, authorial consent, and copyright law. Sid Black implemented the model training and evaluation code and processed YouTube Subtitles, Stack Exchange, and GitHub. Laurence Golding implemented deduplication, performed the n-gram analysis, and processed OpenWebText2. Travis Hoppe processed FreeLaw, Pubmed Ab- stracts, ExPorter, and PhilPapers. Charles Foster performed the topic modeling anal- ysis, contributed to the discussion of authorial con- sent, and processed USPTO Backgrounds. Jason Phang implemented and performed the GPT- 2/3 perplexity analysis and advised the project. Horace He performed the bias and sentiment anal- ysis. Anish Thite implemented and performed the pro- fanity analysis and processed Hacker News. Noa Nabeshima processed GitHub. Shawn Presser processed BookCorpus2. Connor Leahy wrote the alignment implication analysis and the model training code.

B Excluded Datasets

transparency:

US Congressional Record. The ofﬁcial record of the United States Congress (1800 – today) records important points of debate at the highest levels of American government. It reﬂects the opinions and biases of the polit- ical class over the past 200 years, including segregationism and xenophobia. In particular, we found a large quantity of extremely racist content that we did not feel appropriate for a dataset intended for general-purpose language modeling.
Fanﬁction. Hundreds of GiB of fanﬁction has been written and put online, primarily on the websites www.fanfiction.net and www.https://archiveofourown. org/. This represents a signiﬁcant untapped resource for language modeling as it is al- most exclusively short-form ﬁction, a writing style that is not represented in most language modeling datasets. We ultimately decided to exclude fanﬁction on logistical grounds: we found other sources of data that were easier to obtain.
Literotica. Literotica is a website where users can upload short-form erotic ﬁction. We had originally planned on including it in the Pile and even went as far as scraping and process- ing it. However we decided to not include it for several reasons. Firstly, once we decided to exclude fanﬁction, Literotica represented our sole source of short-form ﬁction, which would likely lead to undesirable biases in the trained model. Secondly, Literotica would require signiﬁcantly more investigation, as- sessment, and care than we spent on the other datasets. Thirdly, Literotica contains a signiﬁ- cant amount of stereotyping, including racial fetishes. While Literotica is likely usable for some tasks, we are not comfortable including it in the Pile.

In the course of building the Pile, we considered including and ultimately decided to not use sev- eral datasets. We excluded several datasets on the grounds that they were too small to be worth spend- ing time on or because the English component of the data did not merit inclusion on its own. How- ever we also decided to exclude several data sets for other reasons, which we document here for

C Dataset Details

This section contains additional information about each dataset listed in Section 2, including how it was obtained, how it was processed, and any other details relevant for replication. The intent of this section is to provide as much detail as possible, so that Pile can be replicated in the future if nec- essary, and so that any future processing of these and similar datasets can use or improve on our methods. As such, all code created for processing has been made publicly available under permissive open source licenses and is referenced in footnotes where applicable.

C.1 Pile-CC

We extract Common Crawl using jusText (Endrédy and Novák, 2013). Our ﬁltering implementation uses a classiﬁer trained against the OpenWebText2 dataset. We process only a small fraction of the available Common Crawl data; we break the list of urls to individual WARC ﬁles from 2013 to 2020 into 3679 chunks and process 22 random chunks.

C.1.1 WARC vs WET

CommonCrawl data is available in two main for- mats: Web ARChive (WARC) ﬁles, which contain a full record of the crawl as well as the raw HTML of the webpage, and WET ﬁles, which contain pre- extracted versions of the contents of the WARC ﬁles. The WET ﬁles have poor quality, often con- taining large amounts of boilerplate text like menus and page footers, but due to the lower bandwidth and computation requirements necessary to use WET ﬁles, prior work based on CC have mainly focused on using WET ﬁles while applying clean- ing such as document level ﬁltering (Brown et al., 2020; Wenzek et al., 2019), or n-sentence level deduplication with very aggressive heuristics (Raf- fel et al., 2019).

We do not believe that document level ﬁltering is sufﬁcient for WET ﬁles because many of the issues with WET ﬁles stem from intra-document boilerplate. We also ﬁnd many of the heuristics used in Raffel et al. (2019), such as the removal of all lines without terminal punctuation, the word “javascript”, and 3-sentence deduplication to be too aggressive.

C.1.2 Extraction

In addition to jusText, we also considered Traﬁ- latura, Newspaper, Goose3, and DragNet. While we were originally intending on creating an extrac- tion benchmark, this proved infeasible given our available resources, and we chose jusText based on visual inspection of the output. In inspection, we noticed that jusText has the characteristic that it dis- cards more data than many other extractors, which is not a major drawback given the large volume of CC data available. This was as expected, given jusText’s intended application for text corpora cre- ation. In contrast, traﬁlatura is, for instance, better at preserving the structure of the website faithfully, often correctly extracting elements such as tables, but it kept too much unnecessary boilerplate. Had we used traﬁlatura, we would have required an addi- tional intra-page ﬁltering step to remove boilerplate from the page.

C.1.3 Languages

While jusText does technically support several other languages, the quality on those languages is worse than on English as many constants in the algorithm are speciﬁcally tuned for English. Ad- ditionally, jusText is completely unable to handle languages such as Chinese and Japanese, which do not use spaces to delimit words.

Due to the difﬁculty of maintaining an acceptable level of extraction quality across all languages, we decided to restrict the scope of the CC dataset to only English and leave a high-quality, fully multi- lingual, WARC-based CC-based dataset to future work. To ﬁlter for only English, we use the py- cld2 library and only attempt to extract text from documents where English is the most common lan- guage.

We use pycld2 instead of fasttext because it is ca- pable of classifying the language from the HTML directly, and since jusText requires knowledge of the language of the webpage before extraction. Ad- ditionally, pycld2 was signiﬁcantly faster than jus- Text, and by only processing with jusText doc- uments classiﬁed as English by pycld2, we re- duced the required computation by approximately half.

Extracting text from websites for language model- ing, especially for multilingual corpora, is highly nontrivial, and we leave the reﬁnement of such extraction to future work.

C.1.4 Filtering

To ﬁlter CC for quality, we follow Brown et al. (2020) in training a classiﬁer to classify between a known high quality dataset and CC. We use fasttext with an n-gram size of 2. We ran experiments us- ing both the entire Pile and just OpenWebText2 as the positive examples, with score distributions on unseen CC data as shown in Figure 9. We decided to use only OpenWebText2 for positive examples for our ﬁnal CC data because of the low sensitivity α Filtering Ratio

C.2 Pubmed Central

We use pandoc 1.19.2.4 (MacFarlane, 2006– 2020) to convert the JATS format data provided by PMC to markdown. Afterwards, we remove any line beginning with :::, which is used by pandoc to indicate html classes in markdown.

C.3 Books3

No additional details.

C.4 OpenWebText2

To produce the dataset, URLs and their associated metadata were ﬁrst extracted from all Reddit sub- missions up to April 2020. URLs were dedupli- cated, with each unique URL featuring a list of associated submissions metadata, and an aggre- gate score. URLs with an aggregate score of less then 3 were removed. The links were then scraped and processed with Newspaper scraper. Dedupli- cation was performed at the document level using in memory MinHashLSH through the DataSketch library.

Both ﬁltered and raw versions were produced, with the raw version only deduplicated by URL. The ﬁl- tered version contains 65.86 GB of uncompressed text across 17,103,059 documents. The raw version is much larger, at 193.89GB of uncompressed text across 69,547,149 documents.

C.4.1 Extractor

Choice We chose to use Newspaper instead of jusText for OpenWebText2 for consistency with OpenWeb- TextCorpus. Additionally, by using multiple differ- ent html extractors for different components of the Pile, we reduce the potential impact of systematic biases from any one extractor negatively impacting the dataset.

C.5 ArXiv

We downloaded the TEX sources of papers 2020 up dump (the last ﬁle included in our data is arXiv_src_2007_068.tar) via arXiv’s S3 Bulk Source File Access18, and used pandoc 1.19.2.4 to convert these source ﬁles to Markdown, discarding any papers which had errors during the conversion process. This yielded a total of 1,264,405 papers.

We remove any line beginning with :::, which is used by pandoc to indicate html classes in mark- down.

C.6 GitHub

We separate the data gathering process into two steps:

Gathering a list of the desired repositories and their metadata
Extracting all text data useful for language modeling from each repository

For the ﬁrst step, mirroring the approach of the WebText dataset, we use GitHub ‘stars’ as a proxy for quality, and choose to gather only repositories with more than 100 stars. For practical reasons, we also limit the list of repositories gathered to reposi- tories with less than 1GB of ﬁles. Since Github’s API limits the number of search results to 1000, in order to comprehensively gather all repositories we need to create many small queries that each return fewer than 1000 results in such a way that every repository of interest will be returned by at least one of our queries. To achieve this, we bound our initial search by size to return only repositories be- tween a lower bound of 0 and 5 bytes. At the time of writing, this returns 965 results. For the next step, we set our lower bound one above our previ- ous upper bound, and decide on a new upper bound that should also return fewer than 1000 results by

18 https://arxiv.org/help/bulk_data_s3

(a) OpenWebText2

(b) Full Pile

Figure 9: Score distribution of documents from Common Crawl given different classiﬁer training data.

Because we wanted to limit the size of the overall Pile, we randomly sampled 95.0 GiB of the 630.64 GiB of Github data we collected in total and leave quality ﬁltering to future work.

However, we believe code generation will be an in- creasingly important component of language mod- els as they continue to scale up and increase in their ability to generalize. As such, we hope to extend this dataset in future work.

C.7 FreeLaw

We download the court opinions data in bulk from CourtListener,19 and extract the raw text using BeautifulSoup.

C.8 Stack Exchange

To construct the dataset, we download and parse every Stack Exchange database dump to plaintext ﬁles. We opt to extract the top three answers with at least three upvotes, discarding all other responses. We only include the plain text ques- tion and response and do not incorporate any meta- data. Motivated by large-scale language models’ few-shot ability (Brown et al., 2020), we provide context by prepending all questions and answers with Q:\n\n and A:\n\n respectively.

The resulting dataset contains a total of 15,622,475 documents across a total of 365 Stack Exchanges and Meta-Stack Exchanges, the bulk of which is from StackOverﬂow.

C.9 USPTO Backgrounds

The United States Patent and Trademark Ofﬁce (USPTO) has published bulk archives of the full using the results from our last query to estimate our new upper bound as (lowerbound+(1000/(n/r)), where n is the number of previous results and r is the range of bounds in the previous step.

19 https://www.courtlistener.com/api/

Figure 10: Left: number of new submissions/year Right: to arXiv grouped by domain over time. fractional submission rates for each of the domains. https://arxiv.org/help/ Figure from stats/2019_by_area/

This tends not to overshoot, because Github repos- itories follow a power distribution with respect to size, but if it does, we simply use the amount of repositories our new query returned in order to con- struct a new upper bound estimate.

Using the gathered list of repositories, we clone each one, extract any text-based ﬁles, and discard the rest. Because some repositories took an imprac- tical amount of time to clone and/or extract, we set a hard time limit of 300 seconds for both the git cloning and text extraction steps. As such, some larger repositories may only be partially extracted. We also impose a ﬁle size limit of 100kB on ex- tracted ﬁles, as we found that the majority of ﬁles over that size were typically very repetitive auto- generated source ﬁles or data ﬁles, and that setting this ﬁle size limit was an effective cleaning step to limit the data to code.

Text of all patents granted in the US from 1976 to September 2020. From these archives, we extract the Background sections, along with key grant- speciﬁc metadata, such as the inventor, assignee, and classiﬁcation information.

The ﬁle format used for storing bulk text US patents has changed over time. Prior to 2002, all of the datasets are in a specialized format called APS (Automated Patent System). Since 2002, the data is XML encoded. Partially as a function of this change, the location of the “Background” section has also shifted. Our converter accounts for these structural shifts and extracts the raw text from each patent’s Background.

C.10 PubMed Abstracts

About one-third of the articles in the dataset were missing or contained a malformed title or abstract and were excluded. Additionally, PubMed Cen- tral (see Section 2.2) contains full-text resources to many recent publications; any publications which already appear in PMC are excluded from this set. To process the data, we concatenated the title and abstract and removed any copyright information. The remaining dataset contains 15,518,009 titles and abstracts.

C.11 Project Gutenberg

No additional details.

C.12 OpenSubtitles

To create the text dataset, we simply extract the subtitle text from each XML ﬁle in the English language dataset provided by Tiedemann (2016), discarding any provided metadata.

C.13 Wikipedia (English)

We use the wikipedia/20200301.en dataset from TensorFlow Datasets.20 We prepend the ti- tle to the body of each article, separated by two newlines.

C.14 DeepMind Mathematics

We include instances from the Easy, Medium, and Hard components of DeepMind Mathemat- ics, breaking each curriculum item (such as algebra__polynomial_roots) into 8 KiB chunks.

20 https://www.tensorflow.org/datasets/catalog/wikipedia#wikipedia20200301en

C.15 Ubuntu IRC

We processed all logs from July 5, 2004 through September 1, 2020.

To process the data, all system messages, such as joins, disconnects, nick changes, etc. were discarded, but actions (i.e using /me) were kept. Timestamps were removed, and all logs for the same channel in a given week were concatenated into a single document, with each the logs for each day prepended with the date if that day’s log is non-empty.

C.16 BookCorpus2

The original BookCorpus consists of 11,038 books. However, due to issues with availability of the original BookCorpus, as well as the possibility of collecting a larger version, we decided to collect our own version of BookCorpus using a similar methodology as Kobayashi (2018). Our version of BookCorpus contains 17,868 books instead.

We create and use a modiﬁed version of the epub- to-text converter in Kobayashi (2018) that:

• Correctly preserves the document structure across chapters, matching the table of contents very closely;

• Correctly renders tables of data, whereas by default html2txt produces poor-quality re- sults for tables,

• Correctly preserves code structure, so that source code is visually coherent,

• Converts numbered lists from “1.” to “1.”

• Runs text (Speer, through full the ftfy.fix_text() 2019), replacing Unicode apostrophes with ascii apostrophes and expanding Unicode ellipses to “…” (three separate ascii characters).

C.17 EuroParl

We download the data in bulk from 21. We re- move all basic tag information and only retain the name of each document as a title. For ex- ample, becomes Pronk, and then ex- tract the body of each document, discarding those that are shorter than 200 characters.

21 http://www.statmt.org/europarl/

C.18 HackerNews

Educational topics.

We ﬁrst use the Hackernews BigQuery dataset to obtain a list of all story ids in our date range. For the Pile we use the ﬁrst Hacker News post (1) to post number 24531712. This corresponds to a date range of approximately 10/09/2006 to 09/20/2020. We use the BigQuery dataset to gather story ids for efﬁciency purposes. However, the BigQuery dataset was lacking some information for stories, so we used the ofﬁcial Hacker News API for story and comment text retrieval.

Hacker News displays and stores comments in a tree-like manner, with children comments replying to parent comments. However, most language mod- els require input data to be in a sequential form. Considering each path through the comment tree as a sequence could be detrimental, since there will be a large amount of near-duplicate comment se- quences. In addition, only taking one path through the comment tree for each story leaves out a large portion of the comment data. Therefore, we parsed comments in a hybrid form. For every top-level comment (comments that have no parent comment), we create a sequence of comments by traversing down the comment tree from the top-level com- ment. We choose the next comment by taking the child comment with the highest number of children comments (a cheap attempt at taking a long path through the comment tree, note that it does not take the longest possible path).

We consider all stories that have at least one com- ment and are not ﬂagged by the moderators for potential conduct violations. Since comments are stored in HTML, we use the html2text package to extract the text from the post.

We order each document by listing the title, url, sub-title, and author at the top. Top-level comments are delimited by “\n—-\n” and sub-comment chains are delimited by “\n~~~\n”. We include author and extracted text for each comment.

C.19 YouTube Subtitles

We construct the dataset in three stages:

We build a large list of search terms by prompting a GPT-3 model with a manually selected list of queries, manually ﬁltering the responses, and repeating this process itera- tively until a suitable size is reached. The list of terms is centred around, but not limited to,
We use requests-html to gather a list of 1000 Youtube video IDs for each search term, and deduplicate the resulting video ids across search terms.
We use YoutubeTranscriptApi22 to gather all human generated closed captions for every available language for each video. To align each language in parallel, we split the captions for each language into parallel minute-long sections by timestamp, and ar- range each language in a random order within these sections, appending the language as a header to each minute-long section to provide context. If only a single language is available, the output is just the subtitles, with no header appended.

In total, subtitles for 173,651 videos were gath- ered.

C.20 PhilPapers

The PhilPapers (PP) are indexed using OAI-MPH, the Open Archives Initiative Protocol for Metadata Harvesting. As such, the ﬁrst step to collect the data is to get the XML for all links. This was done using pyoaiharvester.23

From that, each publication is downloaded. Some entries do not exist, or have been removed by the authors. Papers with text are extracted using pdfbox, and papers with non-machine readable text are ignored. Non-English language publica- tions are kept, and the metadata reﬂects the lan- guage reported by the OAI-MPH XML. The text is ﬁltered with pdf_filter.py from PDFextract, and we discard any papers with less than 1000 char- acters.24

C.21 NIH Grant abstracts: ExPORTER

The NIH provides a bulk-data repository for awarded applications through the ExPORTER ser- vice covering the ﬁscal years 1985–present. These data come from the NIH, but also other other Health and Human Services agencies (ACF, AHRQ, CDC, HRSA, FDA), and the VA. Additionally, the NIH provides a legacy data format named CRISP for awarded applications during the ﬁscal years 1970– 2009.

22 https://github.com/jdepoix/youtube-transcript-api

23 https://github.com/vphill/pyoaiharvester/

24 https://github.com/sdtblck/PDFextract

We merged both the ExPORTER and CRISP data to form a consolidated dataset of awarded appli- cations. Entries were deduplicated based off their application ID, and excluded if their abstract text was missing or too short. Small grants, especially administrative ones, consisted solely of short boil- erplate. For this reason, we further deduplicated on abstract text. All grants types were considered, in- cluding new applications (Application Type Code 1) and renewals (Application Type Code 2) as the text differed enough to provide novel input. The text was then minimally parsed to remove admin- istrative boilerplate, (ex. most old awards contain some variation of “description: (provided by appli- cant)”). In total, there were 939,668 grant applica- tion abstracts added.

C.22 Enron Emails

To extract the data, we used the mailparser package25 to extract the body of each email as a document.

D General Data Processing

This section discusses any processes applied across multiple datasets.

To combine the constituent datasets, we iterate until the size of the output dataset is the desired size, drawing documents from datasets at random, weighted by the number of documents in each dataset times the number of epochs desired on that dataset. Because the number of documents involved is high, by the law of large numbers, the number of copies of each dataset present in the Pile is approximately equal to its epoch count.

Shufﬂing a dataset posed a major problem due to our limited memory and computational budget. We follow Hardin (2018), a method descended from Rao (1961), and interleave our output to produce 30 output piles.

We hold out approximately 10GiB of data from the Pile, of which 2GiB are used to create the val- idation and test splits, and the remainder is held in reserve. From the training set, we remove any elements that are also present verbatim in any of the held out data, to prevent leakage.

D.1 Weights

Similar to Brown et al. (2020), we increase the weight of certain components such that the number of epochs elapsed on data we consider high quality is greater than one. Our choice of weights was primarily informed by the source of the data and the size of the dataset; we attempted to upweight academic texts the most, which we felt provided the highest quality data, as well as smaller sets, such that they would have a more pronounced impact on the data. We strictly disallowed any data more than 3 epochs and avoided having any data with more than 2 epochs.

D.2 Deduplication

Due to memory constraints we did not perform Pile wide de-duplication. Instead, de-duplication was performed at the document level within Open- WebText2 and Pile-CC as those sets were the most likely to contain duplicate documents.

The same technique was used for both OpenWeb- Text2 and Common Crawl—MinHashLSH with the Python Datasketch library.26 We used 10 hash functions for each Minhash and an approximate Jaccard similarity of 0.5. This produced a dupli- cate rate of 28% in OpenWebText2 and 26% for Common Crawl.

The main challenge here was computational, lead- ing us on a journey through the various LSH per- sistence options. A simple quadratic Minhash com- parison of all documents would have taken several hundred thousand years, motivating the use of LSH. Initially, we did not have sufﬁcient RAM for in- memory LSH and chose to use the Cassandra back- end when de-duplicating OpenWebText2. This was reasonably fast, but the same method resulted in a corrupted database about 3 4 of the way through pro- cessing Common Crawl. After the Cassandra cor- ruption, we brieﬂy tested the experimental Mongo implementation; however this was quite slow due to the nature of Mongo itself. In the end, we ran in-memory LSH on a machine with enough RAM for Common Crawl, taking several days.

25 https://github.com/SpamScope/mail-parser

26 https://github.com/ekzhu/datasketch

D.3 Downstream Validation Leakage

Component

To avoid leakage of data from downstream evalu- ations, recent work (Radford et al., 2019; Brown et al., 2020; Shoeybi et al., 2019) has removed any data in the training set that may overlap with the evaluation metrics. We decided not to perform any such removal, because it is impossible to antici- pate all potential downstream evaluation metrics, and so any particular selection of metrics would inevitably either become obsolete as the choice of benchmarks in the ﬁeld changes, or potentially hin- der the development of new benchmarks for models trained on Pile.

For models trained on Pile and evaluated on metrics other than Pile’s own validation and test sets, we encourage authors to remove overlaps between Pile and the validation data of these additional down- stream evaluations. We do not anticipate that such leakage removal will hurt model performance, as the validation sets of most benchmarks are very small in relation to the size of the Pile, and so choosing to evaluate on more metrics will not be a disadvantage for any model.

E Investigating data

E.1 13-Gram Analysis

As part of our exploratory analysis, we calcu- lated the counts of all 13-grams across Common Crawl. We chose n = 13 due to its use in prior work (Brown et al., 2020). There were a total of 40,216,231,078 different 13-grams in this dataset. The 1000 most common range from 11 million occurrences down to 20k.

The most frequently occurring 13-grams were character repetitions used for styling such as !”, at 11 “– –”, “* * * *”, “! million, 5.8 million and 1.1 million respectively. Other characters used in this manner include the following: “# . > ?”. In the 264k count range, we see repetitions of badly formatted HTML escape characters “; &nbsp”, “; amp”. Boilerplate from standard forum software appears around the 180k occurrences range, such as the following: “select the forum that you want to visit from the selection below”.

Overall, a large amount of common HTML and CSS is included in the top 1000, along with boil- erplate text from Amazon Afﬁliate Advertising, Pile-CC PubMed Central Books3 OpenWebText2 Arxiv Github FreeLaw StackExchange USPTO Backgrounds PubMed Abstracts Gutenberg (PG-19) OpenSubtitles Wikipedia (en) DM Mathematics Ubuntu IRC BookCorpus2 EuroParl HackerNews YoutubeSubtitles PhilPapers NIH ExPorter Enron Emails TripAdvisor, SimplyHired, Associated Press, Post- Media, The FCC etc. PHP error messages and password login prompts also made an appearance. It may be of interest to fans of Portal that repeti- tions of “the cake is a lie .” achieved a high count.

Table 7: Tokens per byte for Pile components

E.2 Benchmark Perplexity

Computation

To compute the perplexity for a given dataset, we tokenize each document separately, divide the docu- ment into segments of up to the maximum sequence length of the model (1024 tokens for GPT-2, 2048 for GPT-3), and predict the logits of the each seg- ment. The inputs to the model are the immediate prior tokens the e.g. for scoring tokens 1 to 1024, we provide tokens 0 to 1023 at the input context. The respective language model implementations handle the causal attention masking. This ensures that every token in the dataset is scored exactly once. This also means that some tokens will have more input context than others. We then aggregate over the whole dataset and compute the ﬁnal perplexity score. The perplexity for the whole Pile is computed by aggregating over the constituent datasets (i.e. weighted by dataset size, not a simple average of dataset perplexities). Both GPT-2 and GPT-3 share the same tokenizer and vocabulary, making the perplexity scores directly comparable. We use the Hugging Face (Wolf et al., 2020) im- plementation of GPT-2, and the OpenAI API for GPT-3. The davinci model in the OpenAI API is presumed to correspond to a 175B parameter version of GPT-3.

In Table 8 we show the test set perplexities (i.e. not normalized by UTF-8 length, as in Table 2). Be- cause of the costs associated with using the OpenAI API, we compute test perplexities on only one-tenth of the test set in Tables 8 and Table 2. Speciﬁcally, we randomly sample one-tenth of the documents of each dataset except for three: Ubuntu IRC, Book- Corpus2, and PhilPapers. In Table 9, we show test perplexity computed on the full test set on all GPT-2 models.

Figure 11: Test loss (log perplexity) over the Pile, buck- eted by position in the input sequence based on the model’s maximum sequence length. To smooth out the lines, we bucket 4 positions per plotted datapoint. (e.g. positions 0–3, positions 2044–2047). Later tokens are predicted with more context and thus see lower perplex- ities.

E.3 Pejorative Content

Initially we decided on separating pejorative con- tent into 4 groups: sex-related terminology, slurs, neither of these categories, and both of these cate- gories. We adapted a public “naughty words” list and broke them into these categories with the in- tern of looking at the proportion of each category in each dataset. However, this provided many is- sues.

First, any blacklist of words would be hard-pressed to catch all the instances of pejorative content, since purposeful misspellings of words could evade the censor and still have the intended effect. Further- more, words and their intents are always evolving, therefore any list created would likely be always outdated. Another issue pertains to sorting the words into the categories. Words are highly de- pendent on their context, so a word would change categories with different contexts.

Figure 12: Percentage of sentences classiﬁed as pro- fane in the Pile. The percentage of the CC component and the weighted mean of the Pile as a whole are shown as horizontal lines

F Data Samples

The following consists of two random, non- cherrypicked 512-byte samples from each con- stituent dataset of the Pile, sampled from the vali- dation split.

F.1 Pile-CC

pot trending topics and the coverage around them. First up, there’s a bit of a visual redesign. Previously, clicking on a trending topic would highlight a story from one publication, and you’d have to scroll down past a live video section to view related stories. Facebook is replacing that system with a simple carousel, which does a better job of showing you different coverage options. To be clear, the change doesn’t affect how stories are sourced, according to Facebook. It’s still the same algorithm pickine public safety. He said the bridge saves commuters two or three minutes when trains pass – and those minutes could be vital.

“Two to three minutes may not mean much if you’re just driving home from work, but if you’re the one waiting for an ambulance to get to your home, if you’re the one waiting for a ﬁre truck to get to your home, if you’re the one waiting for a police car to get to your home, those two to three minutes could mean the difference between life or death,” Sharp said. “That’s what this pro

F.2 PubMed Central

…(Omitted)…

After this section, please refer to the appendix for a direct summary in the form of a paper. It contains details on preprocessing and processing methods for code, mathematics, and more.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

Decontamination | Pile

Decontamination | Pile

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

TL;DR

1 Introduction

1.1 Contributions

2 The Pile Datasets

2.1 Pile-CC

2.2 PubMed Central

2.3 Books3

2.4 OpenWebText2

2.5 ArXiv

2.6 GitHub

2.7 FreeLaw

2.8 Stack Exchange

2.9 USPTO Backgrounds

2.10 Wikipedia (English)

2.11 PubMed Abstracts

2.12 Project Gutenberg

2.13 OpenSubtitles

2.14 DeepMind Mathematics

2.15 BookCorpus2

2.16 Ubuntu IRC

2.17 EuroParl

2.18 YouTube Subtitles

2.20 NIH Grant Abstracts

2.21 Hacker News

2.22 Enron Emails

3 Benchmarking Language Models with

3.1 Benchmarking Guidelines

3.2 Test Perplexity with GPT-2 and GPT-3

3.3 Relative Componentwise GPT-3 Pile

4 Evaluation

4.1 Methodology

4.2 Results

5 Structural Statistics

5.1 Document Lengths and Tokenization

5.2 Language and Dialects

6 Investigating and Documenting the Datasets

6.2 Topical Distribution

6.1 Documenting Methods

6.3 Pejorative Content

6.4 Bias and Sentiment Co-occurrence

6.4.1 Gender

6.4.2 Religion

6.5 Author Consent and Public Data

6.4.3 Race

7 Implications and Broader Impacts

7.1 Legality of Content

7.2 Acceleration of AI Timelines

7.3 Negative LM Output

8 Related Work

Appendices

A Contributions

B Excluded Datasets

C Dataset Details

C.1 Pile-CC

C.1.1 WARC vs WET

C.1.2 Extraction

C.1.3 Languages

C.1.4 Filtering

C.2 Pubmed Central

C.3 Books3

C.4 OpenWebText2

C.4.1 Extractor

C.5 ArXiv

C.6 GitHub

C.7 FreeLaw

C.8 Stack Exchange

C.9 USPTO Backgrounds

C.10 PubMed Abstracts

C.11 Project Gutenberg

C.12 OpenSubtitles

C.13 Wikipedia (English)

C.14 DeepMind Mathematics

C.15 Ubuntu IRC

C.16 BookCorpus2

C.17 EuroParl

C.18 HackerNews