[데이터 전처리 및 프로세스 핵심색인마킹]
Contents
1. 서론
서론에서는 대규모 언어모델(Large Language Models, LLMs)을 훈련시키는 데 필요한 대규모 데이터셋의 중요성을 강조합니다. 이는 최근 연구에서 제안된 것으로, 모델 크기 뿐만 아니라 데이터셋 크기를 동시에 확장하는 것이 중요하다고 지적합니다.
자연어 처리의 발전은 점차적으로 컴퓨팅 규모에 의존하게 되었으며, 특히 대규모 언어모델의 경우, 모델 크기와 데이터셋 크기를 함께 확대하는 것이 중요해졌습니다(Hoffmann et al., 2022). 이전의 연구들은 주로 모델 크기에 초점을 맞추었지만, 최근 연구는 데이터의 질과 양 모두를 증가시키는 것이 필수적임을 언급합니다. 이는 데이터의 질적, 법적 문제로 인해 데이터 확보가 병목 현상을 일으킬 수 있다는 문제점을 제기합니다(Villalobos et al., 2022). 예를 들어, GPT-3와 같은 모델을 최적으로 훈련하기 위해서는 최소 3,5000억 토큰이 필요한데, 이는 현재 공개된 데이터셋의 양보다 훨씬 많은 수치입니다.
LLM의 발전은 단순히 계산 규모의 증가만으로 이루어지는 것이 아니라, 품질이 높은 대규모 데이터셋를 사용함으로써 더욱 향상됩니다. 이런 데이터는 종종 웹 크롤링을 통해 얻지만, 이런 데이터의 품질은 다양하며, 때로는 저품질 데이터가 포함될 수 있습니다. 따라서, 고품질의 training dataset를 확보하는 것이 중요하며, 이는 훈련된 모델의 성능에 직접적인 영향을 미칩니다.
2. 선행 연구
선행 연구에서는 대규모 데이터셋의 필요성과 함께 데이터의 품질이 LLM의 성능에 미치는 영향을 다룹니다. 초기 LLM들은 간단한 문장 데이터를 사용했지만, 점차 문서 중심의 데이터로 이동했으며, 이는 모델이 더 긴 문맥을 이해할 수 있게 했습니다. 그러나 이런 데이터셋는 종종 큐레이션 과정을 거치지 않아 품질이 낮을 수 있으며, 이는 모델 훈련에 부정적인 영향을 미칠 수 있습니다.
대규모 언어모델 훈련을 위해 다양한 데이터 소스가 사용되었습니다. 초기 모델들은 긴 문서가 포함된 데이터셋의 중요성을 인식했고(Radford et al., 2018), 이후 웹 크롤링 데이터가 주류를 이루게 되었습니다. 그러나, 이런 웹 데이터는 종종 저품질 데이터가 많아 효과적인 모델 훈련에 한계가 있었습니다. 따라서, 고품질 데이터의 필요성이 대두되었고, 여러 연구에서는 웹 데이터와 고품질 데이터를 결합한 방식을 사용했습니다.
3. 방법
이 논문에서는 대규모 웹 데이터를 활용하여 언어 모델을 훈련하기 위한 새로운 파이프라인 MDR(MacroData Refinement)을 소개합니다. 이 파이프라인은 엄격한 중복 제거와 필터링을 통해 데이터의 질을 향상시키며, 다음과 같은 원칙을 따릅니다.
설계 원칙
제안된 방법은 웹 데이터의 정제와 중복 제거 과정을 포함합니다. 이는 효율적인 데이터 사용과 고품질 데이터의 확보를 목표로 합니다.
데이터 읽기
웹에서 스크랩된 데이터는 WARC 형식으로 제공되며, 주요 내용만 추출하기 위한 여러 단계를 거칩니다.
URL 필터링
부적절하거나 사기성 사이트를 걸러내기 위한 첫 번째 필터링입니다. 이 단계는 각 URL의 점수를 기반으로 하며, 특정 키워드의 존재를 평가합니다.
텍스트 추출
주요 내용만을 추출하기 위해 ‘trafilatura’ 라이브러리를 사용합니다. 추출된 텍스트는 규칙적인 표현을 사용하여 추가적으로 포맷팅됩니다.
언어 식별
FastText 언어 분류기를 사용하여 문서의 주 언어를 식별합니다. 이 과정에서 자연어가 아닌 페이지는 제거됩니다.
3.1. 데이터 정제(필터링)
반복 제거
중복되거나 반복적인 내용을 포함하는 문서는 모델의 일반화 능력을 저하시킬 수 있으므로 제거합니다.
문서별 필터링
스팸이나 특수 문자 시퀀스 등의 저품질 콘텐츠를 걸러내기 위해 품질 필터를 적용합니다.
라인별 수정
웹 페이지의 필요 없는 부분들을 제거하기 위한 필터가 적용됩니다.
데이터 정제는 불필요하거나 오류가 있는 데이터를 제거하는 과정으로 다음과 같이 수식으로 표현할 수 있습니다.
\[P_{filtered} = \{d \in D \\| f(d) > \theta\}\]\(D\)는 원본 데이터 집합, \(d\)는 개별 문서, \(f(d)\)는 문서의 품질을 평가하는 함수, 그리고 \(\theta\)는 품질 임계값입니다. 이 수식은 데이터를 필터링하는 데 사용되는 품질 기준을 수학적으로 정의합니다.
3.2. 중복 제거
퍼지 중복 제거
MinHash 알고리즘을 사용하여 유사 문서를 제거합니다. 이 방법은 문서 간의 근사적 유사성을 평가하여 높은 중복도를 가진 문서 쌍을 제거합니다.
정확한 중복 제거
특정 문자열 시퀀스의 완전한 일치를 찾아 제거합니다. 이는 문서 내의 일부만을 변경하므로 문서 전체를 삭제하지 않습니다.
URL 중복 제거
URL의 중복을 확인하고 중복되는 URL을 제거합니다. 이는 데이터셋의 다양성을 확보하기 위해 필수적입니다.
중복 제거는 데이터셋에서 중복되는 항목을 식별하고 제거하는 과정입니다. 이 과정은 다음과 같은 수합적 접근 방식을 사용합니다.
\[D_{unique} = \bigcup_{i=1}^{n} \{d_i \in D \\| \not\exists j < i, d_j \equiv d_i\}\]\(\equiv\)는 두 문서가 동등하다는 것을 나타내며, 데이터셋에서 중복을 제거함으로써 모델이 데이터의 다양성을 더 잘 학습하도록 도울 수 있습니다.
4. 실험
여러가지 벤치마크와 상태 기반 모델을 사용하여 실험을 설정하고, RefinedWeb이 기존의 고품질 데이터셋과 비교할 때 어떤 성능을 보이는지 평가합니다.
웹 데이터의 성능 평가
여러 파라미터 모델을 사용하여 웹 데이터만으로 훈련된 모델과 고품질 데이터셋을 사용한 모델을 비교합니다. 결과는 RefinedWeb이 고품질 소스를 사용한 모델과 유사한 성능을 보임을 확인합니다.
MDR의 다른 데이터셋 적용
MDR 파이프라인의 필터링과 중복 제거 단계를 다른 데이터셋에 적용하여 그 효과를 평가합니다. 결과적으로, 엄격한 중복 제거는 일관되게 성능을 향상시키며, 필터링 휴리스틱은 데이터 소스에 따라 조정이 필요함을 보여줍니다.
5. 결론
이 연구는 웹 데이터만으로도 충분히 높은 품질의 언어 모델을 훈련할 수 있음을 보여주며, 대규모 데이터의 필터링과 중복 제거가 언어 모델의 성능에 중요한 영향을 미친다는 점을 강조하고, 대규모 언어모델 훈련에 필요한 고품질의 대규모 데이터셋를 생성하는 방법을 제안합니다.
웹 데이터의 정제 및 중복 제거 과정을 통해 데이터의 품질을 향상시키고, 이를 통해 더 나은 모델을 훈련할 수 있다고 언급합니다.
[참고자료 1] 비슷한 데이터셋 Common Crawl
Common Crawl은 캘리포니아 샌프란시스코와 로스앤젤레스에 본사를 둔 비영리 단체로, 웹 사이트를 크롤링하여 수집한 대규모 데이터를 공개적으로 제공합니다. 이 단체는 2007년에 길 엘바즈(Gil Elbaz)에 의해 설립되었으며, 조직의 고문으로는 피터 노빅(Peter Norvig)과 리치 스크렌타(Rich Skrenta), 에바 호(Eva Ho) 등이 있습니다.
Common Crawl은 공개 데이터의 접근성을 높이고, 교육적 및 연구적 목적으로 널리 활용될 수 있는 자원을 제공함으로써, 전 세계 데이터 과학자 및 연구자들에게 중요한 기여를 하고 있습니다.
주요 기능 및 활동
nofollow
및 robots.txt
지침을 준수하며, 수집된 데이터에 포함된 저작권이 있는 작업은 미국에서 공정 이용 주장 하에 배포됩니다.히스토리
.arc
파일 형식에서 .warc
파일 형식으로 전환하여 데이터 저장 방식을 개선라이선스
사용 목적
관련 요인 및 평가
평가 데이터
데이터 집합 목적 및 생성 정보
데이터 구성과 관련된 정보
데이터 처리와 관련된 정보
데이터 수집과 관련된 정보
데이터 전처리 및 레이블링
데이터 사용 및 배포
데이터 업데이트 및 확장
Model Developer and Details
Intended Use
Factors
Evaluation Factors
Metrics
Evaluation Data
Preprocessing: Default prompts and setups from Gao et al. (2021) are used.
Data Refining Standard
Check Question? | Description |
---|---|
For what purpose was the dataset created? | RefinedWeb was created to serve as a large-scale dataset for the pretraining of large language models. It may be used on its own or augmented with curated sources. |
Who created the dataset and on behalf of which entity? | The dataset was created by the Technology Innovation Institute. |
Who funded the creation of the dataset? | The creation of the dataset was privately funded by the Technology Innovation Institute. |
Any other comments? | RefinedWeb is built on top of Common Crawl, using the Macrodata Refinement Pipeline. It adheres to specific principles, including scale, strict deduplication, and neutral filtering. |
What do the instances that comprise the dataset represent? | Instances are text-only documents corresponding to single web pages. |
How many instances are there in total? | RefinedWeb contains approximately 10 billion documents, or around 5 trillion tokens. The public version is a subset representing a tenth of the full version. |
Does the dataset contain all possible instances or is it a sample? | RefinedWeb is built using all Common Crawl dumps until June 2023; it could be updated with additional dumps. The public release is a 600GT random extract from the 5,000GT full dataset. |
What data does each instance consist of? | Each instance is a text-only document with metadata about its origin in Common Crawl and source page URL. A multimodal version also contains interlaced links to images. |
Is there a label or target associated with each instance? | No. |
Is any information missing from individual instances? | No. |
Are relationships between individual instances made explicit? | No. |
Are there recommended data splits? | No. |
Are there any errors, sources of noise, or redundancies in the dataset? | Despite best efforts to filter content and deduplicate documents, some errors or redundancies may be present. |
Is the dataset self-contained or does it rely on external resources? | The base version of the dataset is self-contained, but the multimodal version includes links to external images. |
Does the dataset contain data that might be considered confidential? | All documents in RefinedWeb have been publicly available online. |
Does the dataset contain data that might be offensive or cause anxiety if viewed directly? | Yes, the dataset may contain such content, similar to ThePile. |
How was the data associated with each instance acquired? | The data was acquired by downloading WARC and WET files from the Common Crawl foundation. |
What mechanisms or procedures were used to collect the data? | Common Crawl’s data collection procedures were followed; details can be found on their website. |
If the dataset is a sample, what was the sampling strategy? | When using subsets, random sampling from the original data was employed. |
Who was involved in the data collection process and how were they compensated? | Common Crawl performed the original data collection; authors from this paper were involved in retrieving and preparing it. |
Over what time frame was the data collected? | Data was collected from all Common Crawl dumps from 2008 to January/February 2023. |
Were any ethical review processes conducted? | No. |
Was any preprocessing, cleaning, or labeling of the data done? | Yes, extensive preprocessing and cleaning were done, including content filtering, language identification, and deduplication. |
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? | Raw data was saved during development for some intermediary outputs, but not for the final production version due to storage constraints. |
Is the software used for preprocessing/cleaning/labeling the data available? | No. |
Has the dataset been used for any tasks already? | Yes, it has been used for developing large language models and scientific experiments. |
Is there a repository that links to any or all papers or systems that use the dataset? | No. |
What other tasks could the dataset be used for? | RefinedWeb may be suitable for various downstream uses due to its large-scale and representative nature. |
Is there anything about the composition or collection process that might impact future uses? | The public extract draws only from the English version, limiting multilingual applications. |
Are there tasks for which the dataset should not be used? | It should not be used for irresponsible or harmful tasks. |
Will the dataset be distributed to third parties outside of the entity on behalf of which it was created? | A 600GT extract will be made publicly available for NLP practitioners, but not the full version. |
How will the dataset be distributed? | The dataset will be available through the Hugging Face Hub. |
When will the dataset be distributed? | The dataset is available immediately. |
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? | The public extract is available under an ODC-By1.0 license; Common Crawl’s ToU should also be followed. |
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? | Not to the knowledge of the creators. |
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? | Not to the knowledge of the creators. |
Will the dataset be updated? | No. |
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? | This information is not provided in the document. |
Data Analysis
Progress in natural language processing is increasingly driven by sheer compute scale alone (Sevilla et al., 2022): as more compute is expended to train large language models (LLM), they gain and exhibit powerful emergent capabilities (Brown et al., 2020; Wei et al., 2022). To best benefit from scaling, recent scaling laws dictate that both model size and dataset size should jointly be increased (Hoffmann et al., 2022). This is at variance with earlier findings, which had argued that scaling should focus on model size first and foremost, with minimal data scaling (Kaplan et al., 2020). This joint scaling paradigm raises significant challenges: although plentiful, text data is not infinite, especially so when considerations on data quality and licensing are taken into account – leading some researchers to argue scaling may soon be bottlenecked by data availability (Villalobos et al., 2022). Concretely, optimally training a GPT-3 sized model (175B parameters) would require no less than 3,500 billion tokens of text according to Hoffmann et al. (2022). This is twice as much as the largest pretraining datasets ever demonstrated (Hoffmann et al., 2022; Touvron et al., 2023), and ten times more than the largest publicly available English datasets such as OSCAR (Ortiz Su´arez et al., 2019), C4 (Raffel et al., 2020), or The Pile (Gao et al., 2020). Massively scaling-up pretraining data is made even more challenging by the fact LLMs are commonly trained using a mixture of web crawls and so-called “high-quality” data (Brown et al., 2020; Gao et al., 2020). Typical high-quality corpora include curated sources of books, technical documents, human-selected webpages, or social media conversations. The increased diversity and quality brought forth by these curated corpora is believed to be a key component of performant models (Scao et al., 2022b). Unfortunately, curation is labor-intensive: typically, each source requires specialized processing, while yielding a limited amount of data. Furthermore, licensed sources raise legal challenges. Nevertheless, most pretraining data is still sourced from massive web crawls which can be scaled up to trillions of tokens with limited human intervention. However, the quality of this data has traditionally been seen as (much) inferior to that of the manually curated data sources. Even finely processed sources of web data, such as C4 (Raffel et al., 2020) or OSCAR (Ortiz Su´arez et al., 2019), are regarded as inferior to curated corpora for LLMs (Rae et al., 2021; Scao et al., 2022b), producing less performant models. To sustain the ever-increasing data needs of larger and larger LLMs, and to streamline data pipelines and reduce the need for human-intensive curation, we propose to explore how web data can be better processed to significantly improve its quality, resulting in models as capable, if not more capable, than models trained on curated corpora.
Pretraining data for large language models
Early large language models identified the importance of datasets with long, coherent documents (Radford et al., 2018; Devlin et al., 2019). Moving on from the previously used sentence-wise datasets (Chelba et al., 2013), they instead leveraged document-focused, single-domain corpora like Wikipedia or Book Corpus (Zhu et al., 2015). As models increased in scale, datasets based on massive web scrapes gained prevalence (Ortiz Su´arez et al., 2019; Raffel et al., 2020). However, further work argued that these untargeted web scrapes fell short of human-curated data (Radford et al., 2019), leading to the wide adoption of curated datasets such as The Pile (Gao et al., 2020), which combine web data with books, technical articles, and social media conversations. At scale, it has been proposed to emulate the human curation process by leveraging weak signals: for instance, by crawling the top links of a forum (Gokaslan et al., 2019). Targeted corpora can also produce domain-specific models (Beltagy et al., 2019), or broaden the expressiveness of models (e.g., for conversational modalities Adiwardana et al. (2020); Thoppilan et al. (2022)). Latest large language models (Brown et al., 2020; Rae et al., 2021; Chowdhery et al., 2022; Scao et al., 2022a) are trained on giant aggregated corpora, combining both massive web scrape and so-called “high-quality” curated single-domain sources (e.g., news, books, technical papers, social media conversations). These targeted sources are often upsampled – from one to five times is most common – to increase their representation in the final dataset. The diversity and “higher quality” brought forth by these aggregated datasets is thought to be central to model quality; web data alone is considered insufficient to train powerful large language models (Liu et al., 2019; Scao et al., 2022b).
Pipelines for web data
Massive web datasets are typically built upon Common Crawl, a publicly available scrape of the internet, which has now been running for 12 years and has collected petabytes of data. Working with data scraped from all over the internet presents unique challenges: notably, a significant portion is low-quality machine-generated spam or pornographic content (Trinh & Le, 2018; Kreutzer et al., 2022). Accordingly, training on unfiltered web data is undesirable, resulting in poorly performing models (Raffel et al., 2020). Modern pipelines focus on filtering out this undesirable content (Wenzek et al., 2020). Broadly speaking, these pipelines usually combine a variety of stages: (1) language identification, leveraging inexpensive n-gram models (e.g., fastText Joulin et al. (2016)); (2) filtering rules and heuristics, such as only keeping lines with valid punctuation,
Macrodata Refinement and Refined Web
We introduce MDR (MacroData Refinement), a pipeline for filtering and deduplicating web data from Common Crawl at very large scale. Using MDR, we produce REFINEDWEB, an English pretraining dataset of five trillion tokens based on web data only. We leverage strict filtering and stringent deduplication to uplift the quality of web data, distilling it down to a corpus matching the quality of aggregated corpora used to train state-of-the-art models.
We abide by the following guidelines:
Reading the data
Common Crawl is available in either WARC (raw HTML response) or WET files (preprocessed to only include plaintext). Individual files correspond to a page at a given URL; these constitute single documents/samples. Working with WET files would spare us from running our own HTML extraction; however, in line with previous works (Gao et al., 2020; Rae et al., 2021), we found WET files to include undesirable navigation menus, ads, and other irrelevant texts. Accordingly, our pipeline starts from raw WARC files, read with the warciolibrary.
URL filtering
Before undertaking any compute-heavy processing, we perform a first filtering based on the URL alone. This targets fraudulent and/or adult websites (e.g., predominantly pornographic, violent, related to gambling, etc.). We base our filtering on two rules: (1) an aggregated blocklist of 4.6M domains; (2) a URL score, based on the presence of words from a list we curated and weighed by severity. We found that commonly used blocklists include many false positives, such as popular blogging platforms or even pop culture websites. Furthermore, word-based rules (like the one used in C4, Raffel et al. (2020)) can easily result in medical and legal pages being blocked. Our final detailed rules based on this investigation are shared in Appendix G.1. Since we intend Refined Web to be used as part of an aggregated dataset along with curated corpora, we also filtered common sources of high-quality data: Wikipedia, arXiv, etc. The detailed list is available in Appendix G.1.
Text extraction
We want to extract only the main content of the page, ignoring menus, headers, footers, and ads among others. Lopukhin (2019) found that trafilatura (Barbaresi, 2021) was the best non-commercial library for retrieving content from blog posts and news articles. Although this is only a narrow subset of the kind of pages making up Common Crawl, we found this finding to hold more broadly. We use trafilatura for text extraction and apply extra formatting via regular expressions: we limit newlines to two consecutive ones and remove all URLs.
Language identification
We use the fastText language classifier of CCNet (Wenzek et al., 2020) at the document level: it uses characters n-gram and was trained on Wikipedia, supporting 176 languages. We remove documents for which the top language scores below 0.65; this usually corresponds to pages without any natural text. For this paper, we focus on English; Refined Web can also be derived for other languages, see Appendix D for details. The data we retrieve at this stage, called RW-RAW, corresponds to what we can extract with the minimal amount of filtering. At this stage, only 48% of the original documents are left, mostly filtered out by language identification.
Repetition removal
Due to crawling errors and low-quality sources, many documents contain repeated sequences; this may cause pathological behavior in the final model (Holtzman et al., 2019). We could catch this content at the later deduplication stage, but it is cheaper and easier to catch it document-wise early on. We implement the heuristics of Rae et al. (2021) and remove any document with excessive line, paragraph, or n-gram repetitions.
Document-wise filtering
A significant fraction of pages are machine-generated spam, made predominantly of lists of keywords, boilerplate text, or sequences of special characters. Such documents are not suitable for language modeling; to filter them out, we adopt the quality filtering heuristics of Rae et al. (2021). These focus on removing outliers in terms of overall length, symbol-to-word ratio, and other criteria ensuring the document is actual natural language. We note that these filters have to be adapted on a per-language basis, as they may result in over-filtering if naively transferred from English to other languages.
Line-wise corrections
Despite the improvements brought forth by using trafilatura instead of relying on preprocessed files, many documents remain interlaced with undesirable lines (e.g., social media counters, navigation buttons). Accordingly, we devised a line-correction filter, targeting these undesirable items. If these corrections remove more than 5% of a document, we remove it entirely. See Appendix G.2 for details. The data we retrieve at this stage has gone through all of the filtering heuristics in the MDR pipeline. We refer to this dataset as RW-FILTERED. Only 23% of the documents of Common Crawl are left, with around 50% of the documents of RW-Raw removed by the filtering.
After filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, boilerplate content being repeated (e.g., licenses), or even plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization (Lee et al., 2022; Hernandez et al., 2022). Since deduplication is expensive, it has seen limited adoption in public datasets (Ortiz Su´arez et al., 2019; Raffel et al., 2020). We adopt an aggressive deduplication strategy, combining both fuzzy document matches and exact sequences removal.
Fuzzy Deduplication
We remove similar documents by applying MinHash (Broder, 1997): for each document, we compute a sketch and measure its approximate similarity with other documents, eventually removing pairs with high overlap. MinHash excels at finding templated documents: licenses with only specific entities differing, placeholder SEO text repeated across websites—see examples of the biggest clusters in Appendix H.1. We perform MinHash deduplication using 9,000 hashes per document, calculated over 5-grams and divided into 20 buckets of 450 hashes. Using less aggressive settings, such as the 10 hashes of The Pile (Gao et al., 2020), resulted in lower deduplication rates and worsened model performance. See Appendix G.3.1 for more details about our MinHash setup.
Exact Deduplication
Exact substring operates at the sequence-level instead of the document-level, finding matches between strings that are exact token-by-token matches by using a suffix array (Manber & Myers, 1993) (e.g., specific disclaimers or notices, which may not compromise the entire document as showcased in Appendix H.2). We remove any match of more than 50 consecutive tokens, using the implementation of Lee et al. (2022). We note that exact substring alters documents, by removing specific spans: we also experimented with dropping entire documents or loss-masking the duplicated strings instead of cutting them, but this didn’t result in significant changes in zero-shot performance—see Appendix G.3.2.
URL Deduplication
Because of computational constraints, it is impossible for us to perform deduplication directly on RW-Filtered. Instead, we split CommonCrawl into 100 parts, where each part contains a hundredth of each dump, and perform deduplication on individual parts. Most of the larger duplicate clusters (e.g., licences, common spams) will be shared across parts, and effectively removed. However, we found that CommonCrawl dumps had significant overlap, with URLs being revisited across dumps despite no change in content. Accordingly, we keep a list of the URLs of all samples we have kept from each part, and remove them from subsequent parts being processed.
We now validate that RefinedWeb can be used to train powerful models, matching the zero-shot performance obtained with curated corpora and state-of-the-art language models.
Evaluation: Unlike previous works, we focus our evaluation on zero-shot generalization across many tasks rather than measuring validation loss. Our evaluation setup is inspired by the one used by the architecture and scaling group of Big Science (Scao et al., 2022b). We base our evaluation on the Eleuther AI evaluation harness (Gao et al., 2021), allowing us to evaluate across a wide range of tasks in the zero-shot setting.
We perform a small-scale study with 1B and 3B parameters models trained to optimality on popular web and curated datasets. Then, we scale up to 1B and 7B models trained on 350GT, and compare zero-shot generalization to state-of-the-art models.
Small-Scale Study
We first consider popular public web datasets such as OSCAR-2019 (Ortiz Suarez et al., 2019), OSCAR-2022 (Abadji et al., 2021), and C4 (Raffel et al., 2020), along with The Pile (Gao et al., 2020) as the most popular publicly available curated dataset, and variations of RefinedWeb (RW-Raw, RW-Filtered, and RW as described in Section 3). All models are trained with the same architecture and internal codebase; they are evaluated within the same framework—only pretraining datasets differ.
Results averaged on the small-scale aggregate of 6 tasks are presented in Table 4. We observe relatively strong performance of all web datasets compared to The Pile, showcasing that curation is not a silver bullet for performant language models. C4 is identified as a strong pretraining dataset, aligning with the findings of Scao et al. (2022b). However, The Pile underperforms more in our benchmarks. Disappointing results on OSCAR-22.01 may be due to the main version of the dataset being distributed without deduplication. For RefinedWeb, both filtering and deduplication significantly improve performance.
Full-Scale Models
We validate these results with comparisons with state-of-the-art models by scaling our previous experiments to train 1B and 7B models on 350GT. Additionally, we train a 1B model on 350GT on The Pile as a control for the influence of our pretraining setup. We compare with several models:
For GPT-3, we distinguish between results obtained through the API (babbage and curie) with the EleutherAI LM evaluation harness (Gao et al., 2021) (), and results reported in their paper, with a different evaluation setup (†). For PaLM and OPT, results were also obtained with a different evaluation suite (†), while for other models they were obtained with the evaluation harness as well (), allowing for more direct comparisons.
Results on main-agg are presented in Figure 1, and in Figure 3 for core-agg and ext-agg. We find that open models consistently underperform models trained on private curated corpora, such as GPT-3—even when using a similar evaluation setup. Conversely, models trained on RefinedWeb are able to match the performance of the GPT-3 series using web data alone, even though common high-quality sources used in The Pile are excluded from RefinedWeb (see Table 14 in Appendix).
Challenging existing beliefs on data quality and LLMs, models trained on adequately filtered and deduplicated web data alone can match the performance of models trained on curated data.
Ablating the contributions and evaluating the performance of individual components in the MDR pipeline is difficult: for most heuristics, there is no agreed-upon ground truth, and changes may be too insignificant to result in sufficient zero-shot signal after pretraining. In the first half of Section 4.2, we identified that subsequent stages of RefinedWeb (raw, filtered, final) led to improvements in performance. In this section, we propose to apply independently the filtering and deduplication stages of MDR to popular pretraining datasets, studying whether they generalize widely.
Results on the small-agg are reported in Table 5. First, we find that improvements from filtering are not systematic. On The Pile, we had to adjust our line length and characters ratio heuristics to avoid expunging books and code. Despite improvements on OSCAR-21.09, C4, and The Pile, our filters worsen performance on OSCAR-22.01; generally, removal rates from filtering do not seem strongly correlated with downstream accuracy. Conversely, deduplication delivers a steady boost across all datasets, and removal rates are better correlated with changes in performance. We find OSCAR-21.09 and C4 to be already well deduplicated, while The Pile and OSCAR-22.01 exhibit 40-60% duplicates. The base version of OSCAR-22.01 is distributed without deduplication; for The Pile, this is consistent with the findings of Zhang et al. (2022). Finally, combining filtering and deduplication results in further improvements; interestingly, although performance is now more uniform across datasets, differences remain, suggesting that flaws in the original text extraction and processing can’t be fully compensated for.
By processing C4 through MDR, we are able to obtain subsets of data which might slightly outperform RefinedWeb; this combines both the stringent filtering of C4 (e.g., strict NSFW word blocklist, 3-sentence span deduplication) with our own filters and deduplication. While such a combination results in rejection rates that would be unacceptable for our target of 3-6 trillion tokens, this represents an interesting perspective for shorter runs, which may be able to extract extremely high-quality subsets from large web datasets.
While filtering heuristics may require source-dependent tuning, stringent deduplication improves zero-shot performance across datasets consistently.
As LLMs are widely adopted, models trained past the recommendations of scaling laws are bound to become increasingly common to amortize inference costs (Touvron et al., 2023). This will further drive the need for pretraining datasets with trillions of tokens, an order of magnitude beyond publicly available corpora. We have demonstrated that stringent filtering and deduplication could result in a five trillion tokens web only dataset suitable to produce models competitive with the state-of-the-art, even outperforming LLMs trained on curated corpora. We publicly release a 600GT extract of RefinedWeb, and note that RefinedWeb has already been used to train state-of-the-art language models, such as Falcon-40B (Almazrouei et al., 2023).
Check Question? | Description |
---|---|
For what purpose was the dataset created? | RefinedWeb was created to serve as a large-scale dataset for the pretraining of large language models. It may be used on its own or augmented with curated sources (e.g., Wikipedia, StackOverflow). |
Who created the dataset and on behalf of which entity? | The dataset was created by the Technology Innovation Institute. |
Who funded the creation of the dataset? | The creation of the dataset was privately funded by the Technology Innovation Institute. |
Any other comment? | RefinedWeb is built on top of Common Crawl, using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication. In designing RefinedWeb, we abided by the following philosophy: (1) Scale first: We intend MDR to produce datasets to be used to train 40-200 billion parameters models, thus requiring trillions of tokens (Hoffmann et al., 2022). For English-only RefinedWeb, we target a size of 3-6 trillion tokens. Specifically, we eschew any labor-intensive human curation process and focus on Common Crawl instead of disparate single-domain sources. (2) Strict deduplication: Inspired by the work of Lee et al. (2022), which demonstrated the value of deduplication for large language models, we implement a rigorous deduplication pipeline. We combine both exact and fuzzy deduplication and use strict settings leading to removal rates far higher than others have reported. (3) Neutral filtering: To avoid introducing further undesirable biases into the model (Dodge et al., 2021; Welbl et al., 2021), we avoid using ML-based filtering outside of language identification. We stick to simple rules and heuristics and use only URL filtering for adult content. |
Check Question? | Description |
---|---|
What do the instances that comprise the dataset represent? | Instances are text-only documents, corresponding to single webpages. |
How many instances are there in total? | RefinedWeb contains approximately 10 billion documents, or around 5 trillion tokens. The public version is a subset representing a tenth of the full version. |
Does the dataset contain all possible instances, or is it a sample (not necessarily random) of instances from a larger set? | RefinedWeb is built using all Common Crawl dumps until the 2023-06 one; it could be updated with additional dumps as they are released. The public release of RefinedWeb is a 600GT random extract of the 5,000GT of the full dataset. For all experiments, we randomly sampled from the public extract or earlier development versions of it. |
What data does each instance consist of? | Each instance is a text-only document, with metadata about its origin in Common Crawl and source page URL. We also distribute a multimodal version of RefinedWeb, containing interlaced links to images. |
Is there a label or target associated with each instance? | No. |
Is any information missing from individual instances? | No. |
Are relationships between individual instances made explicit? | No. |
Are there recommended data splits? | No. |
Are there any errors, sources of noise, or redundancies in the dataset? | Despite our best efforts to filter content that does not qualify as natural language and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant. |
Is the dataset self-contained, or does it link to or otherwise rely on external resources? | The base version of the dataset is self-contained, but the multimodal version is interlaced with links to images – these are not distributed as part of the dataset and constitute an external source. |
Does the dataset contain data that might be considered confidential? | All documents in RefinedWeb have been publicly available online. |
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? | Yes, as this type of data is prevalent on the internet, it is likely our dataset contains such content. Notably, we estimate the prevalence of toxic content in the dataset to be similar to The Pile (Figure 4). |
Check Question? | Description |
---|---|
How was the data associated with each instance acquired? | We downloaded with warcio publicly available WET files from the Common Crawl Foundation. |
What mechanisms or procedures were used to collect the data? | We refer to the Common Crawl website (commoncrawl.org) for details on how they collect data. |
If the dataset is a sample from a larger set, what was the sampling strategy? | Whenever we use subsets, we randomly sample from the original data. |
Who was involved in the data collection process and how were they compensated? | The original data collection was performed by Common Crawl; authors from this paper were involved in retrieving it and preparing it. |
Over what timeframe was the data collected? | We use all Common Crawl dumps from 2008 to January/February 2023. |
Were any ethical review processes conducted? | No. |
Check Question? | Description |
---|---|
Has the dataset been used for any tasks already? | Yes, this data has been used to develop large language models: both for scientific experiments (e.g., this paper) and production use. |
Is there a repository that links to any or all papers or systems that use the dataset? | No. |
What (other) tasks could the dataset be used for? | RefinedWeb was built as a large-scale corpora representative of the web, and as such, it may serve many downstream uses which are difficult to predict. |
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? | For the public extract of RefinedWeb, we chose to only draw from the English version of the dataset, preventing multilingual applications. |
Are there tasks for which the dataset should not be used? | Any tasks which may be considered irresponsible or harmful. |
Check Question? | Description |
---|---|
Will the dataset be distributed to third parties outside of the entity on behalf of which the dataset was created? | Yes, we make a 600GT extract publicly available for NLP practitioners. We currently don’t plan to share the full version of the dataset. |
How will the dataset be distributed? | The dataset will be made available through the Hugging Face Hub. |
When will the dataset be distributed? | The dataset is available immediately. |
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? | The public extract is made available under an ODC-By1.0 license; users should also abide by the Common Crawl ToU: Common Crawl ToU. |
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? | Not to our knowledge. |
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? | Not to our knowledge. |
Check Question? | Description |
---|---|
Who will be supporting/hosting/maintaining the dataset? | The dataset will be hosted on the Hugging Face Hub, and we have no plans to further support or maintain it once it is released. |
How can the owner/curator/manager of the dataset be contacted? | You can contact the owner/curator/manager at falconTextGenerationLLM@tii.ae. |
Is there an erratum? | No. |
Will the dataset be updated? | No. |
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? | This information is not provided in the document. |