[Decontamination 핵심 색인마킹]
Contents
Component | Raw Size | Weight | Epochs | Effective Size | Mean Document Size |
---|---|---|---|---|---|
Pile-CC | 227.12 GiB | 18.11% | 1.0 | 227.12 GiB | 4.33 KiB |
PubMed Central | 90.27 GiB | 14.40% | 2.0 | 180.55 GiB | 30.55 KiB |
Books3† | 100.96 GiB | 12.07% | 1.5 | 151.44 GiB | 538.36 KiB |
OpenWebText2 | 62.77 GiB | 10.01% | 2.0 | 125.54 GiB | 3.85 KiB |
ArXiv | 56.21 GiB | 8.96% | 2.0 | 112.42 GiB | 46.61 KiB |
Github | 95.16 GiB | 7.59% | 1.0 | 95.16 GiB | 5.25 KiB |
FreeLaw | 51.15 GiB | 6.12% | 1.5 | 76.73 GiB | 15.06 KiB |
StackExchange | 32.20 GiB | 5.13% | 2.0 | 64.39 GiB | 2.16 KiB |
USPTO Backgrounds | 22.90 GiB | 3.65% | 2.0 | 45.81 GiB | 4.08 KiB |
PubMed Abstracts | 19.26 GiB | 3.07% | 2.0 | 38.53 GiB | 1.30 KiB |
Gutenberg (PG-19)† | 10.88 GiB | 2.17% | 2.5 | 27.19 GiB | 398.73 KiB |
OpenSubtitles† | 12.98 GiB | 1.55% | 1.5 | 19.47 GiB | 30.48 KiB |
Wikipedia (en)† | 6.38 GiB | 1.53% | 3.0 | 19.13 GiB | 1.11 KiB |
DM Mathematics† | 7.75 GiB | 1.24% | 2.0 | 15.49 GiB | 8.00 KiB |
Ubuntu IRC | 5.52 GiB | 0.88% | 2.0 | 11.03 GiB | 545.48 KiB |
BookCorpus2 | 6.30 GiB | 0.75% | 1.5 | 9.45 GiB | 369.87 KiB |
EuroParl† | 4.59 GiB | 0.73% | 2.0 | 9.17 GiB | 68.87 KiB |
HackerNews | 3.90 GiB | 0.62% | 2.0 | 7.80 GiB | 4.92 KiB |
Youtube Subtitles | 3.73 GiB | 0.60% | 2.0 | 7.47 GiB | 22.55 KiB |
PhilPapers | 2.38 GiB | 0.38% | 2.0 | 4.76 GiB | 73.37 KiB |
NIH ExPorter | 1.89 GiB | 0.30% | 2.0 | 3.79 GiB | 2.11 KiB |
Enron Emails† | 0.88 GiB | 0.14% | 2.0 | 1.76 GiB | 1.78 KiB |
The Pile | 825.18 GiB | 1254.20 GiB | 5.91 KiB |
서론
최근 일반 목적의 언어 모델링에서의 돌파구는 대규모 텍스트 데이터를 사용한 대규모 모델 훈련의 효과를 downstream 애플리케이션에서 입증하였습니다. 언어 모델 훈련이 계속해서 확대됨에 따라, 고품질의 대규모 텍스트 데이터에 대한 수요는 계속해서 증가할 것입니다.
언어 모델링에서 데이터에 대한 요구가 증가함에 따라, 대부분의 기존 대규모 언어모델은 대부분 또는 전부의 데이터로 커먼 크롤(Common Crawl)을 사용하게 되었습니다. 커먼 크롤에서의 훈련은 효과적이었지만, 최근의 연구는 데이터셋의 다양성이 downstream 일반화 능력을 향상시킨다는 것을 보여주었습니다. 또한, 대규모 언어모델은 해당 도메인에서 상대적으로 소량의 훈련 데이터만으로도 새로운 도메인의 지식을 효과적으로 습득할 수 있음이 입증되었습니다. 이런 결과는 소수의 데이터 소스만을 사용하여 훈련된 모델과 비교할 때, 많은 수의 작고, 고품질이며, 다양한 데이터셋을 혼합함으로써 모델의 일반적인 교차 도메인 지식과 downstream 일반화 능력을 향상시킬 수 있다는 것을 시사합니다.
이런 필요성을 해결하기 위해, 대규모 언어모델 훈련을 위해 설계된 825.18 GiB 규모의 영어 텍스트 데이터셋인 ‘더 파일(The Pile)’을 소개합니다. 더 파일은 22개의 다양하고 고품질의 데이터셋으로 구성되어 있으며, 기존의 자연어 처리 데이터셋과 여러 새로 도입된 데이터셋을 포함하고 있습니다. 더 파일은 대규모 언어모델의 훈련뿐만 아니라, 언어 모델의 교차 도메인 지식과 일반화 능력에 대한 광범위한 벤치마킹을 위해서도 유용합니다.
새로운 데이터셋은 PubMed Central, ArXiv, GitHub, FreeLaw Project, Stack Exchange, 미국 특허청, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, NIH ExPorter 등 다양한 출처에서 파생되었습니다. 또한, 원래의 OpenWebText와 BookCorpus 데이터셋의 확장판인 OpenWebText2와 BookCorpus2도 도입하였습니다.
이외에도 여러 기존 고품질 데이터셋—Books3, Project Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DM Mathematics, EuroParl, Enron Emails 코퍼스를 통합하였고, 향상된 추출 품질의 커먼 크롤 부분집합인 Pile-CC도 새롭게 도입하였습니다.
이 논문의 주요 기여는 다음과 같습니다
Recent breakthroughs in general-purpose language modeling have demonstrated the effectiveness of training massive models on large text corpora for downstream applications (Radford et al., 2019; Shoeybi et al., 2019; Raffel et al., 2019; Rosset, 2019; Brown et al., 2020; Lepikhin et al., 2020). As the field continues to scale up language model training, the demand for high-quality massive text data will continue to grow (Kaplan et al., 2020).
The growing need for data in language modeling has caused most existing large-scale language models to turn to the Common Crawl for most or all of their data (Brown et al., 2020; Raffel et al., 2019). While training on the Common Crawl has been effective, recent work has shown that dataset diversity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale language models have been shown to effectively acquire knowledge in a novel domain with only relatively small amounts of training data from that domain (Rosset, 2019; Brown et al., 2020; Carlini et al., 2020). These results suggest that by mixing together a large number of smaller, high quality, diverse datasets, we can improve the general cross-domain knowledge and downstream generalization capabilities of the model compared to models trained on only a handful of data sources.
To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for training large scale language models. The Pile is composed of 22 diverse and high-quality datasets, including both established natural language processing datasets and several newly introduced ones. In addition to its utility in training large language models, the Pile can also serve as a broad-coverage benchmark for cross-domain knowledge and generalization ability of language models.
We introduce new datasets derived from the following sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.
In addition, we incorporate several existing highquality datasets: Books3 (Presser, 2020), Project Gutenberg (PG-19) (Rae et al., 2019), OpenSubtitles (Tiedemann, 2016), English Wikipedia, DM Mathematics (Saxton et al., 2019), EuroParl (Koehn, 2005), and the Enron Emails corpus (Klimt and Yang, 2004). To supplement these, we also introduce a new filtered subset of Common Crawl, Pile-CC, with improved extraction quality.
Figure 1: Treemap of Pile components by effective size.
The core contributions of this paper are:
Through our analyses, we confirm that the Pile is significantly distinct from pure Common Crawl data. Additionally, our evaluations show that the existing GPT-2 and GPT-3 models perform poorly on many components of the Pile, and that models trained on the Pile significantly outperform both raw and filtered Common Crawl models. To complement the performance evaluations, we also perform an exploratory analysis of the text within the Pile to provide a detailed picture of the data. We hope that our extensive documentation of the construction and characteristics of the Pile will help researchers make informed decisions about potential downstream applications.
Finally, we make publicly available the preprocessing code for the constituent datasets of the Pile and the code for constructing alternative versions2. In the interest of reproducibility, we also document all processing performed on each dataset (and the Pile as a whole) in as much detail as possible. For further details about the processing of each dataset, see Section 2 and Appendix C.
The Pile is composed of 22 constituent sub-datasets, as shown in Table 1. Following Brown et al. (2020), we increase the weights of higher quality components, with certain high-quality datasets such as Wikipedia being seen up to 3 times (“epochs”) for each full epoch over the Pile. Detailed information about the construction of each dataset is available in Appendix C.
Table 1: Overview of datasets in the Pile before creating the held out sets. Raw Size is the size before any up- or down-sampling. Weight is the percentage of bytes in the final dataset occupied by each dataset. Epochs is the number of passes over each constituent dataset during a full epoch over the Pile. Effective Size is the approximate number of bytes in the Pile occupied by each dataset. Datasets marked with a † are used with minimal preprocessing from prior work.
Common Crawl is a collection of website crawls from 2008 onwards, including raw web pages, metadata and text extractions. Due to the raw nature of the dataset, Common Crawl has the advantage of including text from diverse domains, but at the cost of varying quality data. Due to this, use of Common Crawl typically necessitates well-designed extraction and filtering. Our Common Crawl-based dataset, Pile-CC, uses jusText (Endrédy and Novák, 2013) on Web Archive files (raw HTTP responses including page HTML) for extraction, which yields higher quality output than directly using the WET files (extracted plain text).
PubMed Central (PMC) is a subset of the PubMed online repository for biomedical articles run by the United States of America’s National Center for Biotechnology Information (NCBI), providing open, full-text access to nearly five million publications. Most publications indexed by PMC are recent, and their inclusion is mandated for all NIH funded research starting from 2008 by the NIH Public Access Policy. We included PMC in the hopes that it will benefit potential downstream applications to the medical domain.
Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020). Bibliotik consists of a mix of fiction and nonfiction books and is almost an order of magnitude and other metadata, we focused specifically on court opinions due to an abundance of full-text entries. This data is entirely within the public domain.
larger than our next largest book dataset (BookCorpus2).We included Bibliotik because books are invaluable for long-range context modeling research and coherent storytelling.
OpenWebText2 (OWT2) is a generalized web scrape dataset inspired by WebText (Radford et al., 2019) and OpenWebTextCorpus (Gokaslan and Cohen, 2019). Similar to the original WebText, we use net upvotes on Reddit submissions as a proxy for outgoing link quality. OpenWebText2 includes more recent content from Reddit submissions up until 2020, content from multiple languages, document metadata, multiple dataset versions, and open source replication code. We included OWT2 as a high quality general purpose dataset.
ArXiv is a preprint server for research papers that has operated since 1991. As shown in fig. 10, arXiv papers are predominantly in the fields of Math, Computer Science, and Physics. We included arXiv in the hopes that it will be a source of high quality text and math knowledge, and benefit potential downstream applications to research in these areas. ArXiv papers are written in LaTeX, a common typesetting language for mathematics, computer science, physics, and some adjacent fields. Training a language model to be able to generate papers written in LaTeX could be a huge boon to the research community.
GitHub is a large corpus of open-source code repositories. Motivated by the ability of GPT-3 (Brown et al., 2020) to generate plausible code completions despite its training data not containing any explicitly gathered code datasets, we included GitHub in the hopes that it would enable better downstream performance on code-related tasks.
The Free Law Project is a US-registered non-profit that provides access to and analytical tools for academic studies in the legal realm. CourtListener,3 part of the Free Law Project, provides bulk downloads for millions of legal opinions from federal and state courts. While the full dataset provides multiple modalities of legal proceedings, including dockets, bibliographic information on judges,
The Stack Exchange Data Dump4 contains an anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers. It is one of the largest publicly available repositories of question-answer pairs, and covers a wide range of subjects—from programming, to gardening, to Buddhism. We included Stack Exchange in the hopes that it will improve the question answering capabilities of downstream models on diverse domains.
USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office, derived from its published bulk archives5. A typical patent background lays out the general context of the invention, gives an overview of the technical field, and sets up the framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a non-technical audience.
Wikipedia is a standard source of high-quality text for language modeling. In addition to being a source of high quality, clean English text, it is also valuable as it is written in expository prose, and spans many domains.
PubMed Abstracts consists of the abstracts from 30 million publications in PubMed, the online repository for biomedical articles run by the National Library of Medicine. While the PMC (see Section 2.2) provides full-text access, the subset of coverage is significantly limited and biased towards recent publications. PubMed also incorporates MEDLINE, which expands the coverage of biomedical abstracts from 1946 to present day.
3 https://www.courtlistener.com/
4 https://archive.org/details/stackexchange
5 https://bulkdata.uspto.gov/
Project Gutenberg is a dataset of classic Western literature. The specific Project Gutenberg derived dataset we used, PG-19, consists of Project Gutenberg books from before 1919 (Rae et al., 2019), which represent distinct styles from the more modern Books3 and BookCorpus. Additionally, the PG19 dataset is already being used for long-distance context modeling.
The OpenSubtitles dataset is an English language dataset of subtitles from movies and television shows gathered by Tiedemann (2016). Subtitles provide an important source of natural dialog, as well as an understanding of fictional formats other than prose, which may prove useful for creative writing generation tasks such as screenwriting, speechwriting, and interactive storytelling.
The DeepMind Mathematics dataset consists of a collection of mathematical problems from topics such as algebra, arithmetic, calculus, number theory, and probability, formatted as natural language prompts (Saxton et al., 2019). One major weakness of large language models has been performance on mathematical tasks (Brown et al., 2020), which may be due in part to a lack of math problems in the training set. By explicitly including a dataset of mathematical problems, we hope to improve the mathematical ability of language models trained on the Pile.
BookCorpus2 is an expanded version of the original BookCorpus (Zhu et al., 2015), a widely used language modeling corpus consisting of books written by “as of yet unpublished authors.” BookCorpus is therefore unlikely to have significant overlap with Project Gutenberg and Books3, which consist of published books. BookCorpus is also commonly used as dataset for training language models (Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019).
The Ubuntu IRC dataset is derived from the publicly available chatlogs6 of all Ubuntu-related channels on the Freenode IRC chat server. Chatlog data provides an opportunity to model real-time human interactions, which feature a level of spontaneity not typically found in other modes of social media.
6 https://irclogs.ubuntu.com/
EuroParl (Koehn, 2005) is a multilingual parallel corpus originally introduced for machine translation but which has also seen use in several other fields of NLP (Groves and Way, 2006; Van Halteren, 2008; Ciobanu et al., 2017). We use the most current version at time of writing, which consists of the proceedings of the European Parliament in 21 European languages from 1996 until 2012.
The YouTube Subtitles dataset is a parallel corpus of text gathered from human generated closedcaptions on YouTube. In addition to providing multilingual data, Youtube Subtitles is also a source of educational content, popular culture, and natural dialog.
2.19 PhilPapers The PhilPapers7 dataset consists of open-access philosophy publications from an international database maintained by the Center for Digital Philosophy at the University of Western Ontario. We included PhilPapers because it spans a wide body of abstract, conceptual discourse, and its articles contain high quality academic writing.
ExPORTER
The NIH Grant abstracts provides a bulk-data repository for awarded applications through the ExPORTER8 service covering the fiscal years 1985present. We included the dataset because it contains examples of high-quality scientific writing.
Hacker News9 is a link aggregator operated by Y Combinator, a startup incubator and investment fund. Users submit articles defined as “anything that gratifies one’s intellectual curiosity,” but submitted articles tend to focus on topics in computer science and entrepreneurship. Users can comment on submitted stories, resulting in comment trees discussing and critiquing submitted stories. We scrape, parse, and include these comment trees since we believe they provide high quality dialogue and debate on niche topics.
7 https://philpapers.org/
8 https://exporter.nih.gov/
9 https://news.ycombinator.com
The Enron Emails dataset (Klimt and Yang, 2004) is a valuable corpus commonly used for research about the usage patterns of email. We included Enron Emails to aid in understanding the modality of email communications, which is typically not found in any of our other datasets.
While the Pile was conceived as a training dataset for large-scale language models, its coverage of multiple disparate domains makes it also suitable as an evaluation dataset. In this section, we describe how the Pile can be used as a broad-coverage dataset for benchmarking language models.
The Pile is provided as train, validation, and testing splits. The validation and testing components each contain 0.1% of the data, sampled uniformly at random. While this is a far smaller percentage than most datasets, the sheer size of the dataset results in over 1 GiB of validation and testing data each. We highlight that while we have made efforts to deduplicate documents within the Pile (See: Section D.2), it is still possible that some documents are duplicated across the train/validation/test splits.
Our preferred metric is bits per UTF-8 encoded byte (BPB). Bits per byte is preferred over bits per character or perplexity when using Pile as a metric due to its invariance to different tokenization schemes and the ambiguity of measuring characters in Unicode. To compute bits per byte from a given negative log likelihood loss (cid:96), we compute BPB = (LT /LB) log2(e(cid:96)) = (LT /LB)(cid:96)/ ln(2), where LT is the length of the dataset in tokens and LB is the length of the dataset in UTF-8 encoded bytes. We find that LT /LB is 0.29335 GPT-2tokens/byte across the Pile; dataset-specific values of LT /LB can be found in Table 7.
We compute the test perplexity of the constituent datasets of the Pile using GPT-2 (Radford et al.,
2019) and GPT-3 (Brown et al., 2020), shown in Figure 2. We use all available versions of GPT-2, and all four versions of GPT-3 available via the OpenAI API. Because of the cost associated with using the OpenAI API, we evaluate on one-tenth of the respective test sets for most of the constituent datasets. We report the perplexity converted to bits per UTF-8 encoded byte (BPB). Importantly, we compute perplexity by evaluating each document independently within each dataset, as opposed to concatenating all documents as is common practice for computing perplexity on large corpora.
Full details of the perplexity computation can be found in Appendix E.2.
Unsurprisingly, larger language models generally attain lower perplexity compared to smaller models. Recent work has shown an increased focus on the empirical scaling laws of language models (Kaplan et al., 2020; Henighan et al., 2020). As such, we investigate the scaling law for the GPT-2 and GPT-3 families of models on perplexity evaluation on the Pile. The scaling law relation for the GPT-3 family of models is shown in Figure 2.10 The line of best fit shown in the figure has a coefficient of -0.1674 and an intercept of 2.5516.
Figure 2: Scaling law for performance of GPT-2/3 mod- els. ‘Zero-shot’ refers to the fact that none of the mod- els have been fine-tuned on data from the Pile.
Interestingly, while GPT-2 and GPT-3 were not trained on the Pile, there still appears to be a clear scaling law without diminishing returns. We hy- pothesize that this is due to the inherent generaliza- tion capability of these models. We leave a more rigorous analysis of zero-shot scaling laws to future work.
10 While the sizes of GPT-3 models on the OpenAI API have not been publicized, we assume here that ada, babbage, curie and davinci models correspond to 2.7B, 6.7B, 13B and 175B parameter models respectively.
Performance
Determining which components GPT-3 underper- forms on provides information about which Pile components are most dissimilar to the distribution of text (web pages and books) that GPT-3 was trained on. These components would thus make es- pecially good candidates for supplementing GPT-3 training data. These results are also valuable for determining which types of datasets to emphasize for future iterations of the Pile.
Due to the difference in entropy of different datasets, directly comparing perplexity of GPT-3 on different Pile components is not an accurate in- dication of relative performance. Ideally we would train a GPT-3 model from scratch on the Pile and compare the difference in loss per dataset with that of the original GPT-3. Because of resource constraints, we instead use a GPT-2 model trained from scratch on the Pile (see Section 4) to con- struct a proxy measure. To construct our proxy, we first measure the improvement from the GPT- 2-Pile model to GPT-3 on each component. Then, we normalize our results by setting the change on OpenWebText2 to be zero. This computation is shown in the equation below:
Since GPT2-Pile was trained on both OWT2 and the dataset we are evaluating, we expect the second term in ∆set to reflect the difference in the intrinsic difficulty of the two datasets. Thus the total value of ∆set reflects how much harder the dataset we are evaluating was for GPT-3 than OWT2, minus the relative difficulty of the two tasks. As GPT-3 was trained on data very similar to OWT2, this gives us a proxy for how much better GPT-3 would do if it were trained on the Pile.
The results are shown in Figure 3. As a san- ity check, we observe that datasets that are con- tained in, or are extremely similar to, GPT-3’s training set (Books3, Wikipedia (en), Pile-CC and Project Gutenberg) score close to zero on our met- ric.
GPT-3 appears to perform poorly on datasets pertaining to research or academic writing like PubMed Central, PubMed Abstracts, and ArXiv; domain-specific datasets like FreeLaw, Hack- erNews, and USPTO Backgrounds; and on datasets containing predominantly text distinct from natu- ral language, like GitHub and DM Mathematics. In addition, the majority of datasets see less of an improvement than OpenWebText2. As such, we ex- pect a GPT-3 sized model trained on Pile to perform significantly better on research related tasks, soft- ware tasks, and symbol manipulation tasks than the base model. Additionally, this experiment provides evidence that the majority of Pile components are not redundant with the predominantly web-based GPT-3 training data.
We note that this metric is only a proxy for similar- ity, and that it could be confounded by dataset spe- cific scaling effects. Although our results largely accord with expectations, there are some puzzling results, like the datasets on which GPT-3 outper- formed GPT-2 Pile. We hypothesize that GPT-3 learns to be so good at these datasets that train- ing on them explicitly does not notably benefit the model’s performance. We leave a more rigorous analysis of these effects for future work.
To confirm the effectiveness of the Pile for im- proving language modeling quality, we train architecturally-identical 1.3 billion parameter mod- els based on those in Brown et al. (2020) on dif- ferent datasets and evaluate on the WikiText and LAMBADA tasks as benchmarks of language mod- eling ability. We also report results on the Pile as a measure of more cross-domain generaliza- tion.
To ensure a fair comparison across datasets of dif- ferent sizes, we decontaminate any instances of the evaluation sets using the same 13-gram overlap fil- tering as in Brown et al. (2020) and downsample to 40GB to control for dataset size. As we control for dataset size, we emphasize that our evaluation is generous to CC-100 (en), which is about 1/3 the size of the Pile in reality.
We compare the following datasets: the Pile, the English component of the CC-100 dataset11 (Wenzek et al., 2019; Conneau et al., 2020), and a sample of raw CC WET files filtered for English-only.
Table 2: Test perplexity of the Pile using GPT-2 and GPT-3, converted to bits per UTF-8 encoded byte (BPB). Evaluation is performed on one-tenth of the test data of the Pile, on a per-document basis. Bold indicates the best-performing model in each row.
On traditional language modeling benchmarks, the Pile improves significantly on WikiText and shows negligible changes in LAMBADA. However, mod- els trained on Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, as shown in Table 4. This indicates that mod- els trained on the Pile have greater cross-domain generalization capabilities without compromising performance on traditional benchmarks.
The magnitude of improvement over CC-100 per set is shown in Figure 4. Unsurprisingly, there is almost no improvement on Pile-CC. However, the model trained on the Pile performs signifi- cantly better than either of the other models on academic datasets such as ArXiv, Pubmed Central, FreeLaw, and PhilPapers. It also improves significantly on programming-related datasets like Github and StackExchange, on EuroParl, due to the lack of multilingual text in either other dataset, and on DM Mathematics, indicating a significant improvement in mathematical ability.
11 The data was obtained from http://data.statmt.org/cc-100/.
Surprisingly, raw Common Crawl performs better on the Pile BPB than CC-100, despite losing by a significant margin on LAMBADA and WikiText. We hypothesize that this is due to the perplexity based filtering used in CC-100, where a language model is trained on Wikipedia and all data with a perplexity too high or too low is discarded. This effectively discards any data too similar to or too different from Wikipedia, which severely limits the diversity of the collected data. This result suggests that future work using Common Crawl should take caution with filtering to preserve its diversity.
In this section, we cover the Structural Statistics of the dataset, which provide more coarse-grained and statistical information about the Pile. In Section 6, we provide a closer investigation and doc- umentation of the textual content within the Pile datasets.
Figure 3: Change in BPB from GPT-2 trained on Pile to GPT-3 zero-shot, relative to OpenWebText2 BPB change. Dotted line indicates overall Pile change. Lower indicates better relative performance by GPT-3.
Table 3: Size-controlled evaluation results. Each dataset is deduplicated against all evaluation metrics and subsam- pled to approximately 40GB to control for the effects of dataset size. For LAMBADA, we use the variant of the data introduced in Radford et al. (2019) and only evaluate the perplexity on the final token rather than the final word. For WikiText, we report the perplexity per GPT-2 token. † indicates that the size is an estimate.
Each dataset consists of a large number of documents. We analyze the distribution of document lengths, as well as the number of bytes-per-token using the GPT-2 tokenizer in order to put our ablations in context.
While the majority of documents in the Pile are short, there is a long tail of very long documents (Figure 5).
Since the GPT-2 BPE tokenizer is trained on Web-Text, the mean bytes per token is also a very rough indicator of how syntactically different each Pile component is from WebText. For instance, datasets like NIH ExPorter, OpenWebText2 and Books3
consist largely of ordinary text in a similar distribution to WebText, which is reflected in a greater number of bytes per token. On the other hand, many of the sets with the lowest bytes per token are those which consist in large part of non-text content (Github, ArXiv, Stack Exchange, and DM Mathematics) or languages other than English (EuroParl).
While only 13% of the world’s population speaks English, the vast majority of NLP research is done on English. For the Pile, we took a similar approach to the dataset used by Brown et al. (2020) and focused predominantly on English, while also not explicitly filtering out other languages when collecting our own data. When evaluating a multilingual dataset, our main criteria for inclusion was whether the English component of the dataset merited inclusion alone.
Figure 4: Magnitude of BPB improvement of Pile model over CC-100 model on each test set.
Figure 5: Distribution of document lengths in Pile. The highest 1 percentile of document length are considered to be outliers and excluded from this plot.
As the scale of machine learning research has grown, scrutiny has been placed on the ever larger datasets that models are trained on (Prabhu and Birhane, 2020; Biderman and Scheirer, 2020)
Using fasttext (Suárez et al., 2019a), we deter- mine that the Pile is 97.4% English. We note that due to issues with language identification, partic- ularly with rare languages Caswell et al. (2020), this methodology provides only a rough estimate for English content and no reliable conclusions for low-resource languages can be drawn.
While this issue has been raised within AI ethics and bias research (Hovy and Spruit, 2016; Hutchin- son et al., 2020; Blodgett et al., 2020), it has not been a focal point of concern within the language modeling community.
The second, the data state- ments methodology (Bender and Friedman, 2018), was proposed specifically for natural language pro- cessing and has been well received by the NLP community. Our datasheet and data statement will be featured in the GitHub repository where the code for the Pile is stored and will also be available as separate documents on arXiv (Biderman et al., 2021; Biderman, 2021).
In addition to the datasheet and data statement, there is additional information that may be helpful to people training language models that these doc- uments do not cover. In the rest of this section we investigate and document in greater detail some of this additional contextual information.
In order to better understand the specific subject matter covered by the Pile, we performed a topic modeling analysis on its components. Using Gen- sim (Rehurek et al., 2011), we trained 16-topic La- tent Dirichlet Allocation (Blei et al., 2003) models on each component of the validation set of the Pile concurrently, in an online fashion (Hoffman et al., 2010). We filtered the Pile for English only for this analysis. Afterwards, we computed the perplex- ity of the Common Crawl-derived (Pile-CC) topic model on the document sets of the other compo- nents. In this way, we provide a rough measure of the degree to which parts of the Pile contain topics not well covered within Common Crawl.
In Figure 7, these cross-component perplexities are shown, with a vertical line indicating the perplexity of the Pile-CC topic model evaluated on the doc- uments of OpenWebText2. This component was chosen as a baseline of comparison for similar rea- sons as in the previous evaluation: it is derived in a similar manner (filtered crawls of the open web) as the Common Crawl, and thus is expected to contain a similar distribution of topics. Although Pile-CC is somewhat diverse in its content, several of the Pile’s other components deviate from it strongly in their topical focus, as evidenced by higher perplex- ity on Github, PhilPapers, and EuroParl.
We also documented the topical clusters inferred from our LDA models for each component, which we provide in Appendix C. As expected, though the larger CC-derived component itself represents a diversity of content—including politics, education, language processing technologies are Natural widely applicable and can be used in extremely different contexts. What is and is not appropriate data to train on can therefore vary wildly with the application context. In our view, the best approach is to document rather than eliminate potentially con- cerning aspects of datasets13, particularly since the purpose of the Pile is to train general-purpose lan- guage models. The primary goal of our documen- tation, therefore, is to empower NLP researchers to make informed decisions.
Figure 6: Mean bytes per GPT-2-token for each dataset in the Pile. Error bars indicate standard deviation.
2018; Jo and Gebru, 2020), no dataset intended to train massive language models has been seri- ously documented by its creators12. Therefore, our analyses serve two goals: to address ethical con- cerns about the Pile, and to promote and normalize the practice of engaging with the AI ethics litera- ture.
To document the Pile, we chose to implement two frameworks that have been proposed by method- ologists and ethics researchers. The first, the datasheets methodology (Gebru et al., 2018), is a general purpose methodology that is recommended by several methodologists (Raji and Yang, 2019; Biderman and Scheirer, 2020) and appears to be used more frequently by practitioners than alternasports and entertainment—the content clusters it misses become apparent when compared qualita- tively to other components of the Pile. Notably, the data modes covering programming, logic, physics, and legal knowledge appear largely absent.
12 Brown et al. (2020) discusses ethical issues surrounding their model, but do not discuss those surrounding the training dataset itself.
13 That said, we did exclude several datasets, see Appendix B for details.
Due to the wide diversity in origins, it is possible for the Pile to contain pejorative, sexually explicit, or otherwise objectionable content. As this content may not be desirable for some use cases, we break down profanity on a per-dataset level.
We used the profanity-checker Python package (Zhou, 2019). This package includes a “toxicity model” trained on multiple profanity lists as well as the Wikidetox Toxic Comment Dataset (Wulczyn et al., 2016) and classifies a given string as being profane or not profane.
We considered only the English sentences in each dataset using the same language classi- fier from Section 3.7. We did this since profanity-checker is built for English and other languages may improperly impact the results. For instance, the German nominative/accusative feminine/plural definite article “die” is flagged as being profane regardless of context. We split each sentence into words and computed the percentage of words that are flagged as profane for each com- ponent of the Pile. We emphasize that this method- ology is only a proxy for profanity, given the com- plexity of determining whether a given word or phrase is profane in context.
As shown in Figure 8, the Pile as a whole appears less profane than Pile-CC. Further, the majority of Pile components appear less profane than Pile-CC as well.
We also broke each dataset down on a sentence level, to allow profanity-checker to check entire sentences. Splitting datasets by sentence allows for additional context to be considered when determining whether content is pejorative. Our results are shown in Figure 12.
As language models may pick up unexpected biases from the training data, we performed a preliminary analysis of the different components that make up the Pile. Because models with different charac- teristics may be trained on the Pile, we aimed to document the biases of the data and not a specific
model. We primarily focus on co-occurrence tests, where we analyzed what words occur in the same sentence as other specific words. Using this infor- mation, we can estimate what words strongly bias towards a category word, as well as calculate the general sentiment of surrounding words.
We focused our analysis on gender, religion, and race. Our goal is to provide users of this dataset with preliminary guidance on how the different components are biased so that they can make deci- sions on which components to train on.
All tables and figures in this section can be found in the Appendix.
We computed gender associations by computing co- occurrences for binary pronouns. For each word, we computed the difference in the rate it co-occurs with “he” and “she”14 and weighed it by the square root of its frequency. We report the top 15 most biased adjectives or adverbs (Loper and Bird, 2002) for each in Table 10. We see that words like “mil- itary”, “criminal”, and “offensive” strongly bias towards men, while “little”, “married”, “sexual”, and “happy” bias towards women.
In addition, we computed the average senti- ment (Baccianella et al., 2010) of words co- occurring with the gendered pronouns across each dataset in Figure 13. Generally, we find no sig- nificant sentiment bias towards men or women. This, of course, does not mean that the dataset is free of gender bias (as our co-occurrence tests show).
We computed a similar co-occurrence analysis for religion, which can be found in Table 11. Like gen- der, we find that these co-occurrences reflect how these terms are used in pockets of online discourse. For example, “radical” co-occurs with “muslim” at a high rate, while “rational” often co-occurs with “atheist”. This analysis also demonstrates some of the limitations of a purely co-occurrence based analysis. For example, “religious” often co-occurs with “atheist”, which likely reflects the type of con- versations in which the word “atheist” is likely to occur as opposed to a descriptor of “atheist”.
14 We chose to only study male and female pronouns as a simplifying assumption. Studying “they” would require us to isolate its usage as a singular noun.
Figure 7: Log perplexity of 16-topic LDA trained on Pile-CC, on other Pile components. Dotted line indicates log perplexity of the topic model on OpenWebText2. Higher indicates a larger topical divergence from Pile-CC.
occurences with phrases like “black man” or “white woman”.
We show the top 15 most biased words for each demographic in Table 12. Once again, we found that the co-occurrences reflect the context in which these terms are used. For example, the 4 most biased words for “black” are “unarmed”, “civil”, “criminal”, and “scary”.
Similar to above, we compute the average senti- ment of co-occurring words. We report the average sentiment numbers in Table 13. We find that “his- panic/latino” narrowly edges out “asian” for the highest sentiment, followed by “white”. On the other hand, “black” had the lowest sentiment, at -0.15.
We note that for all demographics, the average sen- timent is negative. We hypothesize that this is due to the specific context for which the phrases we use to compute co-occurrences appear. For example, it is often quite common for news articles to describe suspects as an “asian man”.
Another issue with the use of texts in natural lan- guage processing research is consent. Although one is typically not legally obligated to receive the permission of an author to train a NLP algorithm on their work15, many consider doing so a moral obli-
15 Laws vary by country. For a discussion of US law, see Section 7.1
Figure 8: Percentage of words classified as profane in the Pile. The percentage of the CC component and the weighted mean of the Pile as a whole are shown as hor- izontal lines.
In addition, we computed the average sentiment of co-occurrences across each of the constituent datasets in Figure 14. Over the entire dataset, we find that “Buddhist” has the highest sentiment, followed by “Hindu”, “Christian”, “Atheist”, and “Muslim”. Notably, “Jew” is the lowest, perhaps reflecting its historical use as a pejorative.
Finally, we ran the same analysis for racial groups. Here, as identifiers like “black” or “white” of- ten do not indicate race, we instead compute cogation or a good measure to guard against misuse (Obar, 2020; Prabhu and Birhane, 2020). On the other hand, there is significant disagreement sur- rounding the ethics of repurposing data protected by terms of service in research contexts (Vitak et al., 2016; Fiesler et al., 2020), particularly given the power asymmetries inherent in digital platforms, which often close off independent researchers from investigating public data while simultaneously com- pelling users to consent to its private use (Halavais, 2019).
While much of the Pile’s data comes from sources that have expressly consented to its wider dissemi- nation and use in research, researchers often fail to clearly document where their data came from and under what terms its use was consented to. In light of this, we felt it appropriate to release the Pile with transparency around how the authors of its data have indicated that that data can be used.
To provide needed nuance to our discussion of con- sent, we identified three tiers of availability for public use. Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web. Terms of Service (ToS) compliant data is data which is ob- tained and used in a fashion that is known to be consistent with the terms of service of the data host. Data with authorial consent is data for which the original authors of the work consented to the use of their data, or where a reasonable person could not assume that their data would not be used for purposes such as research. ToS compliant data and authorial consented data differ in two main ways: It is important to keep in mind that people typically do not read Terms of Service, and additionally that being ToS-compliant does not entail authorial con- sent. We adopted a strict model of consent, where ambiguous or unknown consent is treated as non- consensual.
Table 5 summarizes our understanding of the status of each of the datasets within the Pile. Datasets marked with a (cid:51)are compliant in the relevant re- spects, though a couple datasets are worth remark- ing on in particular. Book3 and OpenSubtitles are being used in a fashion that is consistent with the terms of service of the data host. However, this is somewhat misleading in that the data host is not authorized to post the data online by the parties that own it. The Enron Emails dataset was not collected with the permission of the authors, but was collected by the U.S. government as part of a criminal investigation. While the people whose emails are in the Enron dataset are aware of this fact, they were not given the ability to consent to its inclusion in any way.
There are five datasets included in the Pile that were not collected and distributed in a ToS compliant fashion and for which the authors had no ability to consent to their data being used. Each of these datasets are widely used, both in the NLP litera- ture and the world at large. With the exception of the YouTube Subtitles dataset, each of these datasets were published by researchers and are passed around freely on the internet. The YouTube Subtitles dataset was created by us for this project, using a very popular unofficial API that is both widely used and easily obtainable on Pip, Conda, and GitHub, among other places. Given the pro- cessing applied and the difficulty of identifying par- ticular files in the Pile, we feel that our use of these datasets does not constitute significantly increased harm beyond that which has already been done by the widespread publication of these datasets.
The Pile represents yet another stepping stone along the path of scaling models and datasets to ever larger sizes and capabilities. There are many serious concerns about how the emergence of pro- gressively stronger AI systems will influence the wider world (Brundage et al., 2018; Amodei et al., 2016; Bostrom and Yudkowsky, 2014; Bostrom, 2014; Critch and Krueger, 2020), and we believe In this section that they merit serious thought. we discuss the legal ramifications of the Pile, and then consider the impact of the Pile to AI align- ment from two angles: accelerating AI timelines and the dangers posed by unaligned language mod- els.
While the machine learning community has be- gun to discuss the issue of the legality of training models on copyright data, there is little acknowl- edgment of the fact that the processing and dis- tribution of data owned by others may also be a violation of copyright law. As a step in that direction, we discuss the reasons we believe that our use of copyright data is in compliance with US copyright law.16
Table 5: Types of consent for each dataset
Under pre (1984) (and affirmed in subsequent rulings such as aff (2013); Google (2015)), non- commercial, not-for-profit use of copyright media is preemptively fair use. Additionally, our use is transformative, in the sense that the original form of the data is ineffective for our purposes and our form of the data is ineffective for the purposes of the original documents. Although we use the full text of copyright works, this is not necessarily dis- qualifying when the full work is necessary (ful, 2003). In our case, the long-term dependencies in natural language require that the full text be used in order to produce the best results (Dai et al., 2019; Rae et al., 2019; Henighan et al., 2020; Liu et al., 2018).
Additional restrictions on some of these works in particular jurisdictions. To enable easier compli- ance with local laws, the Pile reproduction code is available and can be used to exclude certain com- ponents of the Pile which are inappropriate for the user. Unfortunately, we do not have the meta- data necessary to determine exactly which texts are copyrighted, and so this can only be undertaken at the component level. Thus, this should be be taken to be a heuristic rather than a precise determina- tion.
There is serious concern that AI systems may soon be meaningfully more capable than humans in all relevant economic tasks (Grace et al., 2018; Yud- kowsky, 2013). Relatedly, there are serious unre- solved questions surrounding how to properly align such powerful AI systems with human interests (Bostrom and Yudkowsky, 2014; Russell, 2019; Bostrom, 2014; Amodei et al., 2016) and generally avoid morally catastrophic outcomes (Sotala and Gloor, 2017; Shulman and Bostrom, 2020). As such, it has been argued that accelerating the de- velopment of such powerful AI systems may be undesirable before these concerns have been more adequately addressed (Bostrom, 2014).
There are several pragmatic responses to this view:
Due to human competition, curiosity, and cul- tural diversity, halting technological develop- ment is incredibly difficult, if not impossible. (Russell, 2019) (Critch and Krueger, 2020)
AI development is experimental in nature: The alignment problem can only be solved through development, testing and (hopefully non-existential) failure.
High powered language models, along with their more general successors, must be capa- ble of viewing morally problematic content without adopting it in their output. We elabo- rate on this in the following section.
Copyright law varies by country, and there may be With this in mind, we accept the reality that the Pile could potentially accelerate AI timelines. However, we hope our efforts to establish best practices, such as thoroughly documenting the contents of our data, will help encourage diligence for downstream re- searchers on alignment problems.
16 This discussion does not, and is not intended to, constitute legal advice; rather, it is a general discussion of law. Only your attorney can provide assurances that the information contained herein is applicable or appropriate to a particular situation. If in doubt, it is always advisable to speak to an intellectual property attorney.
There has been much discussion about the possi- ble negative effects of powerful language models in the world (Brown et al., 2020; Brundage et al., 2018). Some of these possible problems, such as the ability to mass produce low quality content for the purpose of Search Engine Optimization, are inherent problems to the way online content is distributed, and cannot be stopped by those de- veloping language models alone. Directly solving these problems would require sweeping changes to the architecture of the Internet, such as vastly expanded Public Key Infrastructure and distributed authentication of identity (Ferguson and Schneier, 2003).
Another concern is that training such models on huge datasets will almost inevitably require them to have undesirable content in their training sets, such as that promoting hateful stereotypes (Christian, 2020). Having models output undesirable content is, by definition, undesirable, but we believe that attacking this problem from the training set side is unproductive and ultimately leads us away from optimal solutions. If a person reads a racist piece of content, they do not then immediately adopt its racist views—they may be capable of doing so, but can decide not to. This capacity to understand un- desirable content and then decide to ignore it is an essential future research direction. Not only would this allow models to use “dirtier” data with less concern, but also to use their gained knowledge to better understand what not to do. We recognize that, despite recent progress in human-guided learn- ing (Stiennon et al., 2020), the technology is not yet at this stage, and have thus made a number of editorial decisions as described in this paper. How- ever, this approach seems essential to the future of these models and AI more broadly, and more research is needed.
Self-supervised training of natural language pro- cessing models on large, unlabeled text corpora, has seen widespread adoption in the field. Word representation models such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) were trained on datasets such as Wikipedia, Giga- word (Graff et al., 2003), or a non-public Google News corpus. More recently, language models (Radford et al., 2018, 2019; Brown et al., 2020;
Rosset, 2019; Shoeybi et al., 2019) and masked lan- guage models (Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2019) have been trained on datasets such as Wikipedia, BookCorpus (Zhu et al., 2015), RealNews (Zellers et al., 2019), CC-Stories (Trinh and Le, 2018), and other Internet scrape-derived datasets discussed below. Other datasets such as WikiText (Stephen et al., 2016) have also been used in similar self-supervised training.
As data requirements for language modeling have grown, the field has turned towards Internet scrapes for large-scale datasets (Gokaslan and Cohen, 2019), with Common Crawl being particularly prevalent. Works such as Brown et al. (2020); Wen- zek et al. (2019); Suárez et al. (2019b); Raffel et al. (2019) have relied on Common Crawl to build train- ing datasets for large-scale models. However, these works often highlight the difficulty of cleaning and filtering the Common Crawl data, and often high- light the resulting data quality as a determining factor of model capability.
It has also been increasingly common practice to combine multiple datasets when training lan- guage models. For instance, GPT (Radford et al., 2018) was trained on Wikipedia and BookCorpus, whereas GPT-3 (Brown et al., 2020) was trained on Wikipedia, two fiction datasets, and two web- scraped datasets. The Pile continues the trend of combining large-scale web-scrapes with smaller, higher-quality datasets that capture knowledge we believe would be most beneficial to training lan- guage models.
The two most comparable publicly available datasets to the Pile are CC-100 (Wenzek et al., 2019) and C4/mC4 (Raffel et al., 2019). C4 is comparably-sized to the Pile, while mC4 and CC- 100 are larger, multilingual datasets. However, C4/mC4 require immense computational resources to preprocess the data, with its maintainers even rec- ommending the use of a distributed cloud service,17 setting a high bar of entry to using these datasets. CC-100 is directly downloadable and pre-cleaned; however, its English portion is much smaller than the Pile. Importantly, these three datasets are all de- rived entirely from Common Crawl—as discussed above, the current best practice in training large- scale language models involve using both large web scrapes and more targeted, higher-quality datasets, which the Pile directly addresses.
All authors contributed to the design of the research project and the writing of the paper. Additionally, authors contributed as follows:
Leo Gao led the project, implemented the main Pile codebase, contributed to the model training code, performed the evaluations and the language analysis, interpreted the perplexity analysis results, implemented the processing to create the final data, and processed Pile-CC, PubMed Central, ArXiv, and Ubuntu IRC. Stella Biderman led the data analysis, the broader impact analysis, and the data documentation, and coordinated the project. She also wrote the anal- ysis of structural statistics, authorial consent, and copyright law. Sid Black implemented the model training and evaluation code and processed YouTube Subtitles, Stack Exchange, and GitHub. Laurence Golding implemented deduplication, performed the n-gram analysis, and processed OpenWebText2. Travis Hoppe processed FreeLaw, Pubmed Ab- stracts, ExPorter, and PhilPapers. Charles Foster performed the topic modeling anal- ysis, contributed to the discussion of authorial con- sent, and processed USPTO Backgrounds. Jason Phang implemented and performed the GPT- 2/3 perplexity analysis and advised the project. Horace He performed the bias and sentiment anal- ysis. Anish Thite implemented and performed the pro- fanity analysis and processed Hacker News. Noa Nabeshima processed GitHub. Shawn Presser processed BookCorpus2. Connor Leahy wrote the alignment implication analysis and the model training code.
transparency:
US Congressional Record. The official record of the United States Congress (1800 – today) records important points of debate at the highest levels of American government. It reflects the opinions and biases of the polit- ical class over the past 200 years, including segregationism and xenophobia. In particular, we found a large quantity of extremely racist content that we did not feel appropriate for a dataset intended for general-purpose language modeling.
Fanfiction. Hundreds of GiB of fanfiction has been written and put online, primarily on the websites www.fanfiction.net and www.https://archiveofourown. org/. This represents a significant untapped resource for language modeling as it is al- most exclusively short-form fiction, a writing style that is not represented in most language modeling datasets. We ultimately decided to exclude fanfiction on logistical grounds: we found other sources of data that were easier to obtain.
Literotica. Literotica is a website where users can upload short-form erotic fiction. We had originally planned on including it in the Pile and even went as far as scraping and process- ing it. However we decided to not include it for several reasons. Firstly, once we decided to exclude fanfiction, Literotica represented our sole source of short-form fiction, which would likely lead to undesirable biases in the trained model. Secondly, Literotica would require significantly more investigation, as- sessment, and care than we spent on the other datasets. Thirdly, Literotica contains a signifi- cant amount of stereotyping, including racial fetishes. While Literotica is likely usable for some tasks, we are not comfortable including it in the Pile.
In the course of building the Pile, we considered including and ultimately decided to not use sev- eral datasets. We excluded several datasets on the grounds that they were too small to be worth spend- ing time on or because the English component of the data did not merit inclusion on its own. How- ever we also decided to exclude several data sets for other reasons, which we document here for
This section contains additional information about each dataset listed in Section 2, including how it was obtained, how it was processed, and any other details relevant for replication. The intent of this section is to provide as much detail as possible, so that Pile can be replicated in the future if nec- essary, and so that any future processing of these and similar datasets can use or improve on our methods. As such, all code created for processing has been made publicly available under permissive open source licenses and is referenced in footnotes where applicable.
We extract Common Crawl using jusText (Endrédy and Novák, 2013). Our filtering implementation uses a classifier trained against the OpenWebText2 dataset. We process only a small fraction of the available Common Crawl data; we break the list of urls to individual WARC files from 2013 to 2020 into 3679 chunks and process 22 random chunks.
CommonCrawl data is available in two main for- mats: Web ARChive (WARC) files, which contain a full record of the crawl as well as the raw HTML of the webpage, and WET files, which contain pre- extracted versions of the contents of the WARC files. The WET files have poor quality, often con- taining large amounts of boilerplate text like menus and page footers, but due to the lower bandwidth and computation requirements necessary to use WET files, prior work based on CC have mainly focused on using WET files while applying clean- ing such as document level filtering (Brown et al., 2020; Wenzek et al., 2019), or n-sentence level deduplication with very aggressive heuristics (Raf- fel et al., 2019).
We do not believe that document level filtering is sufficient for WET files because many of the issues with WET files stem from intra-document boilerplate. We also find many of the heuristics used in Raffel et al. (2019), such as the removal of all lines without terminal punctuation, the word “javascript”, and 3-sentence deduplication to be too aggressive.
In addition to jusText, we also considered Trafi- latura, Newspaper, Goose3, and DragNet. While we were originally intending on creating an extrac- tion benchmark, this proved infeasible given our available resources, and we chose jusText based on visual inspection of the output. In inspection, we noticed that jusText has the characteristic that it dis- cards more data than many other extractors, which is not a major drawback given the large volume of CC data available. This was as expected, given jusText’s intended application for text corpora cre- ation. In contrast, trafilatura is, for instance, better at preserving the structure of the website faithfully, often correctly extracting elements such as tables, but it kept too much unnecessary boilerplate. Had we used trafilatura, we would have required an addi- tional intra-page filtering step to remove boilerplate from the page.
While jusText does technically support several other languages, the quality on those languages is worse than on English as many constants in the algorithm are specifically tuned for English. Ad- ditionally, jusText is completely unable to handle languages such as Chinese and Japanese, which do not use spaces to delimit words.
Due to the difficulty of maintaining an acceptable level of extraction quality across all languages, we decided to restrict the scope of the CC dataset to only English and leave a high-quality, fully multi- lingual, WARC-based CC-based dataset to future work. To filter for only English, we use the py- cld2 library and only attempt to extract text from documents where English is the most common lan- guage.
We use pycld2 instead of fasttext because it is ca- pable of classifying the language from the HTML directly, and since jusText requires knowledge of the language of the webpage before extraction. Ad- ditionally, pycld2 was significantly faster than jus- Text, and by only processing with jusText doc- uments classified as English by pycld2, we re- duced the required computation by approximately half.
Extracting text from websites for language model- ing, especially for multilingual corpora, is highly nontrivial, and we leave the refinement of such extraction to future work.
To filter CC for quality, we follow Brown et al. (2020) in training a classifier to classify between a known high quality dataset and CC. We use fasttext with an n-gram size of 2. We ran experiments us- ing both the entire Pile and just OpenWebText2 as the positive examples, with score distributions on unseen CC data as shown in Figure 9. We decided to use only OpenWebText2 for positive examples for our final CC data because of the low sensitivity α Filtering Ratio
We use pandoc 1.19.2.4 (MacFarlane, 2006– 2020) to convert the JATS format data provided by PMC to markdown. Afterwards, we remove any line beginning with :::, which is used by pandoc to indicate html classes in markdown.
No additional details.
To produce the dataset, URLs and their associated metadata were first extracted from all Reddit sub- missions up to April 2020. URLs were dedupli- cated, with each unique URL featuring a list of associated submissions metadata, and an aggre- gate score. URLs with an aggregate score of less then 3 were removed. The links were then scraped and processed with Newspaper scraper. Dedupli- cation was performed at the document level using in memory MinHashLSH through the DataSketch library.
Both filtered and raw versions were produced, with the raw version only deduplicated by URL. The fil- tered version contains 65.86 GB of uncompressed text across 17,103,059 documents. The raw version is much larger, at 193.89GB of uncompressed text across 69,547,149 documents.
Choice We chose to use Newspaper instead of jusText for OpenWebText2 for consistency with OpenWeb- TextCorpus. Additionally, by using multiple differ- ent html extractors for different components of the Pile, we reduce the potential impact of systematic biases from any one extractor negatively impacting the dataset.
We downloaded the TEX sources of papers 2020 up dump (the last file included in our data is arXiv_src_2007_068.tar) via arXiv’s S3 Bulk Source File Access18, and used pandoc 1.19.2.4 to convert these source files to Markdown, discarding any papers which had errors during the conversion process. This yielded a total of 1,264,405 papers.
We remove any line beginning with :::, which is used by pandoc to indicate html classes in mark- down.
We separate the data gathering process into two steps:
Gathering a list of the desired repositories and their metadata
Extracting all text data useful for language modeling from each repository
For the first step, mirroring the approach of the WebText dataset, we use GitHub ‘stars’ as a proxy for quality, and choose to gather only repositories with more than 100 stars. For practical reasons, we also limit the list of repositories gathered to reposi- tories with less than 1GB of files. Since Github’s API limits the number of search results to 1000, in order to comprehensively gather all repositories we need to create many small queries that each return fewer than 1000 results in such a way that every repository of interest will be returned by at least one of our queries. To achieve this, we bound our initial search by size to return only repositories be- tween a lower bound of 0 and 5 bytes. At the time of writing, this returns 965 results. For the next step, we set our lower bound one above our previ- ous upper bound, and decide on a new upper bound that should also return fewer than 1000 results by
(a) OpenWebText2
(b) Full Pile
Figure 9: Score distribution of documents from Common Crawl given different classifier training data.
Because we wanted to limit the size of the overall Pile, we randomly sampled 95.0 GiB of the 630.64 GiB of Github data we collected in total and leave quality filtering to future work.
However, we believe code generation will be an in- creasingly important component of language mod- els as they continue to scale up and increase in their ability to generalize. As such, we hope to extend this dataset in future work.
We download the court opinions data in bulk from CourtListener,19 and extract the raw text using BeautifulSoup.
To construct the dataset, we download and parse every Stack Exchange database dump to plaintext files. We opt to extract the top three answers with at least three upvotes, discarding all other responses. We only include the plain text ques- tion and response and do not incorporate any meta- data. Motivated by large-scale language models’ few-shot ability (Brown et al., 2020), we provide context by prepending all questions and answers with Q:\n\n and A:\n\n respectively.
The resulting dataset contains a total of 15,622,475 documents across a total of 365 Stack Exchanges and Meta-Stack Exchanges, the bulk of which is from StackOverflow.
The United States Patent and Trademark Office (USPTO) has published bulk archives of the full using the results from our last query to estimate our new upper bound as (lowerbound+(1000/(n/r)), where n is the number of previous results and r is the range of bounds in the previous step.
Figure 10: Left: number of new submissions/year Right: to arXiv grouped by domain over time. fractional submission rates for each of the domains. https://arxiv.org/help/ Figure from stats/2019_by_area/
This tends not to overshoot, because Github repos- itories follow a power distribution with respect to size, but if it does, we simply use the amount of repositories our new query returned in order to con- struct a new upper bound estimate.
Using the gathered list of repositories, we clone each one, extract any text-based files, and discard the rest. Because some repositories took an imprac- tical amount of time to clone and/or extract, we set a hard time limit of 300 seconds for both the git cloning and text extraction steps. As such, some larger repositories may only be partially extracted. We also impose a file size limit of 100kB on ex- tracted files, as we found that the majority of files over that size were typically very repetitive auto- generated source files or data files, and that setting this file size limit was an effective cleaning step to limit the data to code.
Text of all patents granted in the US from 1976 to September 2020. From these archives, we extract the Background sections, along with key grant- specific metadata, such as the inventor, assignee, and classification information.
The file format used for storing bulk text US patents has changed over time. Prior to 2002, all of the datasets are in a specialized format called APS (Automated Patent System). Since 2002, the data is XML encoded. Partially as a function of this change, the location of the “Background” section has also shifted. Our converter accounts for these structural shifts and extracts the raw text from each patent’s Background.
About one-third of the articles in the dataset were missing or contained a malformed title or abstract and were excluded. Additionally, PubMed Cen- tral (see Section 2.2) contains full-text resources to many recent publications; any publications which already appear in PMC are excluded from this set. To process the data, we concatenated the title and abstract and removed any copyright information. The remaining dataset contains 15,518,009 titles and abstracts.
No additional details.
To create the text dataset, we simply extract the subtitle text from each XML file in the English language dataset provided by Tiedemann (2016), discarding any provided metadata.
We use the wikipedia/20200301.en dataset from TensorFlow Datasets.20 We prepend the ti- tle to the body of each article, separated by two newlines.
We include instances from the Easy, Medium, and Hard components of DeepMind Mathemat- ics, breaking each curriculum item (such as algebra__polynomial_roots) into 8 KiB chunks.
20 https://www.tensorflow.org/datasets/catalog/wikipedia#wikipedia20200301en
We processed all logs from July 5, 2004 through September 1, 2020.
To process the data, all system messages, such as joins, disconnects, nick changes, etc. were discarded, but actions (i.e using /me) were kept. Timestamps were removed, and all logs for the same channel in a given week were concatenated into a single document, with each the logs for each day prepended with the date if that day’s log is non-empty.
The original BookCorpus consists of 11,038 books. However, due to issues with availability of the original BookCorpus, as well as the possibility of collecting a larger version, we decided to collect our own version of BookCorpus using a similar methodology as Kobayashi (2018). Our version of BookCorpus contains 17,868 books instead.
We create and use a modified version of the epub- to-text converter in Kobayashi (2018) that:
• Correctly preserves the document structure across chapters, matching the table of contents very closely;
• Correctly renders tables of data, whereas by default html2txt produces poor-quality re- sults for tables,
• Correctly preserves code structure, so that source code is visually coherent,
• Converts numbered lists from “1.” to “1.”
• Runs text (Speer, through full the ftfy.fix_text() 2019), replacing Unicode apostrophes with ascii apostrophes and expanding Unicode ellipses to “…” (three separate ascii characters).
We download the data in bulk from 21. We re- move all basic tag information and only retain the name of each document as a title. For ex- ample,
21 http://www.statmt.org/europarl/
Educational topics.
We first use the Hackernews BigQuery dataset to obtain a list of all story ids in our date range. For the Pile we use the first Hacker News post (1) to post number 24531712. This corresponds to a date range of approximately 10/09/2006 to 09/20/2020. We use the BigQuery dataset to gather story ids for efficiency purposes. However, the BigQuery dataset was lacking some information for stories, so we used the official Hacker News API for story and comment text retrieval.
Hacker News displays and stores comments in a tree-like manner, with children comments replying to parent comments. However, most language mod- els require input data to be in a sequential form. Considering each path through the comment tree as a sequence could be detrimental, since there will be a large amount of near-duplicate comment se- quences. In addition, only taking one path through the comment tree for each story leaves out a large portion of the comment data. Therefore, we parsed comments in a hybrid form. For every top-level comment (comments that have no parent comment), we create a sequence of comments by traversing down the comment tree from the top-level com- ment. We choose the next comment by taking the child comment with the highest number of children comments (a cheap attempt at taking a long path through the comment tree, note that it does not take the longest possible path).
We consider all stories that have at least one com- ment and are not flagged by the moderators for potential conduct violations. Since comments are stored in HTML, we use the html2text package to extract the text from the post.
We order each document by listing the title, url, sub-title, and author at the top. Top-level comments are delimited by “\n—-\n” and sub-comment chains are delimited by “\n~~~\n”. We include author and extracted text for each comment.
C.19 YouTube Subtitles
We construct the dataset in three stages:
We build a large list of search terms by prompting a GPT-3 model with a manually selected list of queries, manually filtering the responses, and repeating this process itera- tively until a suitable size is reached. The list of terms is centred around, but not limited to,
We use requests-html to gather a list of 1000 Youtube video IDs for each search term, and deduplicate the resulting video ids across search terms.
We use YoutubeTranscriptApi22 to gather all human generated closed captions for every available language for each video. To align each language in parallel, we split the captions for each language into parallel minute-long sections by timestamp, and ar- range each language in a random order within these sections, appending the language as a header to each minute-long section to provide context. If only a single language is available, the output is just the subtitles, with no header appended.
In total, subtitles for 173,651 videos were gath- ered.
The PhilPapers (PP) are indexed using OAI-MPH, the Open Archives Initiative Protocol for Metadata Harvesting. As such, the first step to collect the data is to get the XML for all links. This was done using pyoaiharvester.23
From that, each publication is downloaded. Some entries do not exist, or have been removed by the authors. Papers with text are extracted using pdfbox, and papers with non-machine readable text are ignored. Non-English language publica- tions are kept, and the metadata reflects the lan- guage reported by the OAI-MPH XML. The text is filtered with pdf_filter.py from PDFextract, and we discard any papers with less than 1000 char- acters.24
The NIH provides a bulk-data repository for awarded applications through the ExPORTER ser- vice covering the fiscal years 1985–present. These data come from the NIH, but also other other Health and Human Services agencies (ACF, AHRQ, CDC, HRSA, FDA), and the VA. Additionally, the NIH provides a legacy data format named CRISP for awarded applications during the fiscal years 1970– 2009.
22 https://github.com/jdepoix/youtube-transcript-api
23 https://github.com/vphill/pyoaiharvester/
24 https://github.com/sdtblck/PDFextract
We merged both the ExPORTER and CRISP data to form a consolidated dataset of awarded appli- cations. Entries were deduplicated based off their application ID, and excluded if their abstract text was missing or too short. Small grants, especially administrative ones, consisted solely of short boil- erplate. For this reason, we further deduplicated on abstract text. All grants types were considered, in- cluding new applications (Application Type Code 1) and renewals (Application Type Code 2) as the text differed enough to provide novel input. The text was then minimally parsed to remove admin- istrative boilerplate, (ex. most old awards contain some variation of “description: (provided by appli- cant)”). In total, there were 939,668 grant applica- tion abstracts added.
To extract the data, we used the mailparser package25 to extract the body of each email as a document.
This section discusses any processes applied across multiple datasets.
To combine the constituent datasets, we iterate until the size of the output dataset is the desired size, drawing documents from datasets at random, weighted by the number of documents in each dataset times the number of epochs desired on that dataset. Because the number of documents involved is high, by the law of large numbers, the number of copies of each dataset present in the Pile is approximately equal to its epoch count.
Shuffling a dataset posed a major problem due to our limited memory and computational budget. We follow Hardin (2018), a method descended from Rao (1961), and interleave our output to produce 30 output piles.
We hold out approximately 10GiB of data from the Pile, of which 2GiB are used to create the val- idation and test splits, and the remainder is held in reserve. From the training set, we remove any elements that are also present verbatim in any of the held out data, to prevent leakage.
Similar to Brown et al. (2020), we increase the weight of certain components such that the number of epochs elapsed on data we consider high quality is greater than one. Our choice of weights was primarily informed by the source of the data and the size of the dataset; we attempted to upweight academic texts the most, which we felt provided the highest quality data, as well as smaller sets, such that they would have a more pronounced impact on the data. We strictly disallowed any data more than 3 epochs and avoided having any data with more than 2 epochs.
Due to memory constraints we did not perform Pile wide de-duplication. Instead, de-duplication was performed at the document level within Open- WebText2 and Pile-CC as those sets were the most likely to contain duplicate documents.
The same technique was used for both OpenWeb- Text2 and Common Crawl—MinHashLSH with the Python Datasketch library.26 We used 10 hash functions for each Minhash and an approximate Jaccard similarity of 0.5. This produced a dupli- cate rate of 28% in OpenWebText2 and 26% for Common Crawl.
The main challenge here was computational, lead- ing us on a journey through the various LSH per- sistence options. A simple quadratic Minhash com- parison of all documents would have taken several hundred thousand years, motivating the use of LSH. Initially, we did not have sufficient RAM for in- memory LSH and chose to use the Cassandra back- end when de-duplicating OpenWebText2. This was reasonably fast, but the same method resulted in a corrupted database about 3 4 of the way through pro- cessing Common Crawl. After the Cassandra cor- ruption, we briefly tested the experimental Mongo implementation; however this was quite slow due to the nature of Mongo itself. In the end, we ran in-memory LSH on a machine with enough RAM for Common Crawl, taking several days.
25 https://github.com/SpamScope/mail-parser
26 https://github.com/ekzhu/datasketch
Component
To avoid leakage of data from downstream evalu- ations, recent work (Radford et al., 2019; Brown et al., 2020; Shoeybi et al., 2019) has removed any data in the training set that may overlap with the evaluation metrics. We decided not to perform any such removal, because it is impossible to antici- pate all potential downstream evaluation metrics, and so any particular selection of metrics would inevitably either become obsolete as the choice of benchmarks in the field changes, or potentially hin- der the development of new benchmarks for models trained on Pile.
For models trained on Pile and evaluated on metrics other than Pile’s own validation and test sets, we encourage authors to remove overlaps between Pile and the validation data of these additional down- stream evaluations. We do not anticipate that such leakage removal will hurt model performance, as the validation sets of most benchmarks are very small in relation to the size of the Pile, and so choosing to evaluate on more metrics will not be a disadvantage for any model.
As part of our exploratory analysis, we calcu- lated the counts of all 13-grams across Common Crawl. We chose n = 13 due to its use in prior work (Brown et al., 2020). There were a total of 40,216,231,078 different 13-grams in this dataset. The 1000 most common range from 11 million occurrences down to 20k.
The most frequently occurring 13-grams were character repetitions used for styling such as !”, at 11 “– –”, “* * * *”, “! million, 5.8 million and 1.1 million respectively. Other characters used in this manner include the following: “# . > ?”. In the 264k count range, we see repetitions of badly formatted HTML escape characters “;  ”, “; amp”. Boilerplate from standard forum software appears around the 180k occurrences range, such as the following: “select the forum that you want to visit from the selection below”.
Overall, a large amount of common HTML and CSS is included in the top 1000, along with boil- erplate text from Amazon Affiliate Advertising, Pile-CC PubMed Central Books3 OpenWebText2 Arxiv Github FreeLaw StackExchange USPTO Backgrounds PubMed Abstracts Gutenberg (PG-19) OpenSubtitles Wikipedia (en) DM Mathematics Ubuntu IRC BookCorpus2 EuroParl HackerNews YoutubeSubtitles PhilPapers NIH ExPorter Enron Emails TripAdvisor, SimplyHired, Associated Press, Post- Media, The FCC etc. PHP error messages and password login prompts also made an appearance. It may be of interest to fans of Portal that repeti- tions of “the cake is a lie .” achieved a high count.
Table 7: Tokens per byte for Pile components
Computation
To compute the perplexity for a given dataset, we tokenize each document separately, divide the docu- ment into segments of up to the maximum sequence length of the model (1024 tokens for GPT-2, 2048 for GPT-3), and predict the logits of the each seg- ment. The inputs to the model are the immediate prior tokens the e.g. for scoring tokens 1 to 1024, we provide tokens 0 to 1023 at the input context. The respective language model implementations handle the causal attention masking. This ensures that every token in the dataset is scored exactly once. This also means that some tokens will have more input context than others. We then aggregate over the whole dataset and compute the final perplexity score. The perplexity for the whole Pile is computed by aggregating over the constituent datasets (i.e. weighted by dataset size, not a simple average of dataset perplexities). Both GPT-2 and GPT-3 share the same tokenizer and vocabulary, making the perplexity scores directly comparable. We use the Hugging Face (Wolf et al., 2020) im- plementation of GPT-2, and the OpenAI API for GPT-3. The davinci model in the OpenAI API is presumed to correspond to a 175B parameter version of GPT-3.
In Table 8 we show the test set perplexities (i.e. not normalized by UTF-8 length, as in Table 2). Be- cause of the costs associated with using the OpenAI API, we compute test perplexities on only one-tenth of the test set in Tables 8 and Table 2. Specifically, we randomly sample one-tenth of the documents of each dataset except for three: Ubuntu IRC, Book- Corpus2, and PhilPapers. In Table 9, we show test perplexity computed on the full test set on all GPT-2 models.
Figure 11: Test loss (log perplexity) over the Pile, buck- eted by position in the input sequence based on the model’s maximum sequence length. To smooth out the lines, we bucket 4 positions per plotted datapoint. (e.g. positions 0–3, positions 2044–2047). Later tokens are predicted with more context and thus see lower perplex- ities.
Initially we decided on separating pejorative con- tent into 4 groups: sex-related terminology, slurs, neither of these categories, and both of these cate- gories. We adapted a public “naughty words” list and broke them into these categories with the in- tern of looking at the proportion of each category in each dataset. However, this provided many is- sues.
First, any blacklist of words would be hard-pressed to catch all the instances of pejorative content, since purposeful misspellings of words could evade the censor and still have the intended effect. Further- more, words and their intents are always evolving, therefore any list created would likely be always outdated. Another issue pertains to sorting the words into the categories. Words are highly de- pendent on their context, so a word would change categories with different contexts.
Figure 12: Percentage of sentences classified as pro- fane in the Pile. The percentage of the CC component and the weighted mean of the Pile as a whole are shown as horizontal lines
The following consists of two random, non- cherrypicked 512-byte samples from each con- stituent dataset of the Pile, sampled from the vali- dation split.
pot trending topics and the coverage around them. First up, there’s a bit of a visual redesign. Previously, clicking on a trending topic would highlight a story from one publication, and you’d have to scroll down past a live video section to view related stories. Facebook is replacing that system with a simple carousel, which does a better job of showing you different coverage options. To be clear, the change doesn’t affect how stories are sourced, according to Facebook. It’s still the same algorithm pickine public safety. He said the bridge saves commuters two or three minutes when trains pass – and those minutes could be vital.
“Two to three minutes may not mean much if you’re just driving home from work, but if you’re the one waiting for an ambulance to get to your home, if you’re the one waiting for a fire truck to get to your home, if you’re the one waiting for a police car to get to your home, those two to three minutes could mean the difference between life or death,” Sharp said. “That’s what this pro
…(Omitted)…
After this section, please refer to the appendix for a direct summary in the form of a paper. It contains details on preprocessing and processing methods for code, mathematics, and more.