00:00:00

Share Your Feedback 🏝️

Model | Data Provenance**

Model | Data Provenance**

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Weight Alignment Tuning Google - WARP** Next: Model | Llemma Pile-2

Model | Data Provenance**

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-25

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

  • url: https://arxiv.org/abs/2310.16787
  • pdf: https://arxiv.org/pdf/2310.16787
  • abstract: The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: this http URL.

[데이터 관련 핵심색인마킹]


Contents


  • 데이터셋 관리와 라이선스 감사를 통한 위험 감소
  • 상세한 데이터 프로비넌스 카드와 법적 가이드 제공
  • 투명성과 책임성 강화를 위한 도구 및 표준 개발

1. 서론

언어 모델의 발전은 주로 대규모 훈련 데이터셋의 다양성과 풍부함에서 기인합니다. 최근에는 수천 개의 데이터셋과 웹 소스를 결합하여 새로운 모델을 개발하는 추세가 있습니다. 그러나 데이터의 출처와 특성을 문서화하고 이해하는 노력은 점점 감소하고 있습니다. 이런 상황은 데이터 유출, 개인 정보 노출, 부적절한 행동 또는 편향 등 다양한 문제를 야기할 수 있으며, 결국 모델의 품질 저하로 이어집니다. 본 연구에서는 데이터의 출처, 생성 및 라이선스 유산 등을 포함하는 데이터 프로비넌스라는 개념을 도입하여 이 문제에 대응하고자 합니다.


2. 데이터 프로비넌스 감사 계획

2.1 데이터 프로비넌스 탐색기 (DPExplorer)

이 계획의 첫 번째 단계로, 널리 사용되는 44개의 훈련 데이터셋 컬렉션을 대상으로 법적 및 AI 전문가의 지도 하에 데이터셋의 출처를 추적합니다. 데이터셋의 각각의 출처와 라이선스, 창작자 정보 등을 자세히 기록하여 데이터셋의 리스크를 이해하고 관리하는 데 도움을 줍니다.

  • 데이터셋의 식별 정보, 특성 및 프로비넌스에 대한 메타데이터 확장은 데이터의 법적 및 윤리적 위험을 평가하는 데 필수적입니다.
  • 법적 전문가와의 협력을 통해 데이터셋의 프로비넌스를 추적하는 파이프라인을 설계합니다.

2.2 라이선스 주석 과정

데이터셋의 라이선스를 검증하는 것은 이 프로젝트의 중요한 부분입니다. 각 데이터셋의 라이선스 유형을 확인하고, 그 조건들을 머신러닝 실무자가 이해하기 쉽게 범주화합니다.

  • 라이선스 유형의 분류는 훈련 데이터의 사용 가능성에 직접적인 영향을 미칩니다.
  • 데이터 원본이 요구하는 라이선스 조건에 따라 데이터셋의 사용이 제한될 수 있습니다.

2.3 데이터 프로비넌스 카드 - 데이터 서지

데이터 프로비넌스 카드는 데이터의 출처와 사용 조건을 명확하게 문서화하여, 데이터셋의 이해와 관리를 도울 수 있습니다. 이 카드는 데이터셋의 출처, 라이선스, 언어, 작업 유형 등을 포함한 메타데이터를 제공합니다.

  • 구조화된 메타데이터 저장소는 데이터셋의 사용과 관련된 법적 조건을 명확히 할 수 있습니다.
  • 데이터 프로비넌스 카드는 연구자들이 데이터 출처를 쉽게 추적하고 문서화할 수 있도록 돕습니다.


3. 데이터 프로비넌스의 경험적 분석

3.1 라이선스 현황 분석

데이터셋의 라이선스 유형과 사용 조건을 대규모로 분석하여, 실제 사용 상황에서의 라이선스의 분포와 특성을 이해합니다.

  • 라이선스 유형의 분포는 데이터셋의 사용 가능성과 법적 리스크를 결정짓는 중요한 요소입니다.
  • 다양한 라이선스 유형을 정확히 이해하고 분류하는 것은 데이터셋의 안전한 사용을 보장하는 데 필수적입니다.

3.2 라이선스 사용 유형별 데이터 가용성 분석

라이선스 제한에 따라 데이터셋의 특성과 가용성이 어떻게 달라지는지 분석합니다. 비상업적 라이선스가 있는 데이터셋은 상업적으로 이용 가능한 데이터셋보다 다양성이 높은 경향이 있습니다.

  • 비상업적 라이선스는 데이터의 사용을 제한하여, 특정 연구나 개발 분야에서 데이터 접근성을 제한할 수 있습니다.
  • 데이터셋의 특성 분석을 통해 라이선스가 데이터 다양성에 미치는 영향을 정량적으로 평가할 수 있습니다.


4. 법적 논의

데이터 라이선스와 저작권의 복잡한 상호작용을 분석하고, 데이터셋의 법적 사용 가능성을 평가합니다. 데이터셋의 생성과 사용에 관한 법적 가이드라인을 제공하여, 연구자와 개발자가 법적 리스크를 관리할 수 있도록 지원합니다.

  • 데이터셋의 법적 상태를 명확히 하는 것은 데이터 사용의 법적 위험을 최소화하는 데 중요합니다.
  • 법적 분석을 통해 데이터셋의 사용 조건과 제한을 정확히 이해할 수 있습니다.


5. 관련 작업

데이터의 문서화와 분석은 NLP 연구에서 중요한 부분으로 강조되어 왔습니다. 이전 연구들은 데이터 문서화의 중요성을 강조하며, 데이터 관리와 사용의 투명성을 높이기 위한 다양한 접근 방법을 제시합니다. 이 논문은 이런 기존 연구를 바탕으로 하여 데이터 라이선스와 관련된 구체적인 도구와 표준을 개발합니다.


1 Introduction

The latest wave of language models, both public (Chung et al., 2022; Taori et al., 2023; Geng et al., 2023) and proprietary (Anil et al., 2023; OpenAI, 2023; Anthropic, 2023; Yoo et al., 2022) attribute their powerful abilities in large part to the diversity and richness of ever larger training datasets, including pre-training corpora, and finetuning datasets compiled by academics (Wei et al., 2021; Sanh et al., 2021; Muennighoff et al., 2022), synthetically generated by models (Taori et al., 2023; Wang et al., 2022a), or aggregated by platforms like Hugging Face (Lhoest et al., 2021). Recent trends see practitioners combining and re-packaging thousands of datasets and web sources (Gao et al., 2020; Penedo et al., 2023; Wang et al., 2022b; Longpre et al., 2023a), but despite some notable documentation efforts (Spacerini, 2021; Biderman et al., 2022), there are diminishing efforts to attribute, document or understand the raw ingredients into new models (Dodge et al., 2021; Bandy and Vincent, 2021; Bommasani et al., 2023a).

A Crisis in Data Transparency & its Consequences. Increasingly, widely used dataset collections are treated as monolithic, instead of a lineage of data sources, scraped (or model generated), curated, and annotated, often with multiple rounds of re-packaging (and re-licensing) by successive practitioners. The disincentives to acknowledge this lineage stem both from the scale of modern data collection (the effort to properly attribute it), and the increased copyright scrutiny (Saveri et al., 2023). Together, these factors have seen fewer Datasheets (Gebru et al., 2021), non-disclosure of training sources (OpenAI, 2023; Anil et al., 2023; Touvron et al., 2023), and ultimately a decline in understanding training data (Sambasivan et al., 2021b; Longpre et al., 2023b).

This lack of understanding can lead to data leakages between training and test data (Elangovan et al., 2021; Carlini et al., 2022), expose personally identifiable information (PII) (Bubeck et al., 2023), present unintended biases or behaviours (Welbl et al., 2021; Xu et al., 2021; Pozzobon et al., 2023), and generally result in lower quality models than anticipated. Beyond these practical challenges, information gaps and documentation debt incur substantial ethical and legal risks. For instance, model releases appear to contradict data terms of use (e.g., WizardCoder (Luo et al., 2023) licensed for commercial use, while training on commerciallyprohibited OpenAI data), license revisions post-public release (with MPT-StoryTeller (Frankle, 2023)), and even copyright lawsuits (e.g. Stability AI (Arstechnica, 2023) and OpenAI (Saveri et al., 2023)). As training models on data is both expensive and largely irreversible, these risks and challenges are not easily remedied. In this work, we term the combination of these indicators, including datasets’ sourcing, creation and licensing heritage, as well as its characteristics, Data Provenance. Unreliable Data Provenance & Licensing. Our work motivates the urgency of tooling that facilitates informed and responsible use of data in both pretraining and finetuning. To empower practitioners to attribute data provenance, we develop a set of tools and standards to trace the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets. We compile and expand relevant metadata with a much richer taxonomy than Hugging Face, Papers with Code, or other aggregators (see Section 2.1). With legal experts, we design a pipeline for tracing dataset provenance, including the original source of the dataset, the associated licenses, creators, and subsequent use.

As a byproduct of our work establishing the Data Provenance of widely used datasets, we are able to characterize the AI data ecosystem/supply chain (Cen et al., 2023; Bommasani et al., 2023c), as well as state of the field for policymakers, researchers and legal experts. Our work points to a crisis in license laundering and informed usage of popular datasets, with systemic problems in sparse, ambiguous, or incorrect license documentation. Notably, we find that 70%+ of licenses for popular datasets on GitHub and Hugging Face are “Unspecified”, leaving a substantial information gap that is difficult to navigate in terms of legal responsibility. Second, the licenses that are attached to datasets uploaded to dataset sharing platforms are often inconsistent with the license ascribed by the original author of the dataset—our rigorous re-annotation of licenses finds that 66% of analyzed Hugging Face licenses were in a different use category, often labeled as more permissive than the author’s intended license. As a result, much of this data is risky to use (or harmfully misleading) for practitioners who want to respect the data provenance of a work. Our initiative reduces “Unspecified“ licenses from 72%+ to 30% and attaches license URLs for under-resourced model developers to more confidently select appropriate data for their needs. To this end, the Data Provenance Initiative supports attribution and responsible AI with the following contributions:

  1. The most extensive known public audit of AI Data Provenance, tracing the lineage of 1800+ text datasets (the “DPCollection”), their licenses, conditions, and sources. We demonstrate a growing adoption and reliance on software licenses in the AI community and synthesize observations into legal guidance for developers (Section 4).
  2. The Data Provenance Explorer (DPExplorer)∗, an open-source repository for downloading, filtering, and exploring data provenance and characteristics. Our tools auto-generate Data Provenance Cards for scalable symbolic attribution and future documentation best practices.
  3. We find a sharp and widening divide between commercially open and closed data, with the latter monopolizing more diverse and creative sources. We suggest a data collection focus to narrow this gap.

2 The Initiative to Audit Data Provenance

The Data Provenance Initiative’s goal is to audit popular and widely used datasets with large-scale Legal and AI expert-guided annotation. We propose a base set of indicators necessary for tracing dataset lineage and understanding dataset risks (described in Section 2.1). As a first contribution of the initiative, we audit 44 instruction or “alignment” finetuning data collections composed of 1858 individual datasets, selected by experts for their widespread adoption and use in the community. The selected collections and their variants see 100s to 10M+ monthly downloads on Hugging Face, with the datasets within these collections tallying to many more Table 1.

The initiative’s initial focus on alignment finetuning datasets was decided based on their growing emphasis in the community for improving helpfulness, reducing harmfulness, and orienting models to human values (Ouyang et al., 2022). Some collections have overlapping datasets and examples, but we choose not to deduplicate to preserve the original design choices, that may include different templates, formatting, and filtering. We remove datasets related to common benchmarks like MMLU (Hendrycks et al., 2020) and BigBench (Srivastava et al., 2023).

Figure 1: The DPCollection annotation pipeline uses human and human-assisted procedures to annotate dataset Identifiers , Characteristics , and Provenance . The Data Lifecycle is traced, from the original sources (web scrapes, human or synthetic text), to curated datasets and packaged collections. Information is collected at each stage, not just the last. The License Annotation Procedure is described in Section 2.2.

2.1 Data Provenance Explorer (DPExplorer)

Our information audit spans (I) identifier information, bridging metadata from several aggregators, including Hugging Face, GitHub, Papers with Code, Semantic Scholar, and ArXiv, (II) detailed dataset characteristics for a richer understanding of training set composition, and (III) dataset provenance for licensing and attribution. We expand our provenance metadata beyond just licenses, because conversations with practitioners revealed they rely not only on data licenses, but on a specific legal & ethical risk tolerance, parameterized by (a) the lineage of licenses, (b) the data source, (c) the creator’s identity, and (d) the precedence of adoption by other developers.

We release our extensive audit, as two tools: (1) a data explorer interface, the Data Provenance Explorer (DPExplorer) for widespread use, and (2) an accompanying repository for practitioners to download the data filtered for license conditions. Practitioners are also able to generate a human-readable, markdown summary, or Data Provenance Card, of the used datasets, and compositional properties for languages, tasks, and licenses (Section 2.3). Modern researchers training on hundreds of datasets often find it onerous to manually curate extensive data cards for these compilations(Mitchell et al., 2019; Gebru et al., 2021). We hope this tool will aid in writing the data attribution and composition sections of these documentation efforts, by providing auto-generated, copy-and-pastable dataframe summaries.

Table 1: Alignment tuning collections and their characteristics. Properties of the collections include the numbers of datasets, dialogs, unique tasks, languages, topics, text domains, Huggingface monthly downloads (“Downs”), and the average length of input and target text, by characters. The Source column indicates whether a collection includes human web text ( ). The dialog formats of each collection can be: zero-shot (Z), few-shot (F), chain-of-thought (C), response ranking (R), and multi-turn dialog (M). The Use column indicates whether a collection includes data licensed for commercial use ( ), data with no license (“unspecified”: ), data only licensed for non-commercial or academic use ( ). Note that these licenses are self-reported and their applicability is complicated, requiring legal consultation. The “O” column indicates if the collection includes OpenAI model generations, which may or may not affect commercial viability (see Section 4)

2.2 License Annotation Process

One of our central contributions is to validate the licenses associated with widely used and adopted datasets. This followed a time-intensive human annotation protocol, to collect dataset authors’ self-reported licenses, and categorize them according to stated conditions. Note that this protocol reflects best efforts to verify self-reported licenses, and does not constitute legal advice (see Section 4). Additionally, it is important to note that the enforceability of these licenses depends on several factors discussed in Section 4. One especially important assumption in cases where datasets are based on data obtained from other sources is that dataset creators actually have a copyright interest in their dataset. This depends on the data source and how creators modify or augment this data, and requires a case-by-case analysis. However, it appears that most developers operate under the general assumption that they alone own their datasets. Our license annotation workflow follows these steps:

  1. Compile all Self-Reported License Information We aggregate all licensing information reported on GitHub, ArXiv, Hugging Face, Papers with Code, and the collection itself (e.g. Super-Natural Instructions, Wang et al. (2022c)).
  2. Search for explicit Data Licenses The annotator searches for a license specifically given to the dataset (not the accompanying code) by the authors. A license is found if (a) the GitHub repository mentions or links a license in reference to the data, (b) the Hugging Face license label was uploaded by the dataset creator themselves, (c) the paper, Hugging Face, or Papers with Code provide a dataset-specific license link, attributable to the data authors.
  3. Identify a License Type A license may fall into a set of common types (e.g. MIT, Apache 2, CC BY SA, etc.), be a “Custom” license, a permission Request Form, or if none was found for the data, Unspecified. If a dataset has multiple licenses, the annotator will list each of them, according to their types.
  4. Categorize Licenses From the perspective of a machine learning practitioner, licensing typically is viewed through the lens of how it impacts the model lifecycle—does it impede or allow for training on the data, downstream use conditions, attributing, modifying or re-distributing it. Based on discussions with industry experts, we categorize licenses based on three important features that impact the model lifecycle: is data usage limited to academic or non-commercial purposes (Permitted Use), does the data source need to be attributed (Attribution), and do derivatives of the data need to licensed under the same terms as the original (Share-Alike). If there are multiple licenses for a dataset, its categorization for each feature is the chosen as the strictest across licenses.
  5. Additional Provenance In practice, legal teams may wish to balance their risk tolerance with more nuanced criteria. For instance, they may be satisfied with using (more permissive) GitHub licenses, even when it is ambiguous whether these apply to the code or the data. They may also wish to include or exclude datasets based on whether these are already widely used in practice, where the original data was sourced from, and if the creator is a competitor. To supplement the above license categories, we also collect all this metadata for fine-grained selection and filtering.

2.3 Data Provenance Card—A Data Bibliography

Prior work has stressed the importance of data documentation and attribution (Bender and Friedman, 2018; Bommasani et al., 2023a). In particular, Gebru et al. (2021)’s Datasheets breaks down documentation into motivation, composition, collection process, processing, uses, maintanence, and distribution. Similarly, Bender and Friedman (2018) ask for curation rationale, language variety, speaker demographic, annotator demographic, speech situation, and text characteristics, among others. However, when models train on many sources of data, even if they are each rigorously documented for each of these fields (rarely the case), it is challenging to cleanly synthesize comprehensive and navigable documentation for the resulting bundle.

To make this process tractable with scale, we propose leveraging Symbolic Attribution, where our tools autogenerate a structured store of the provenance and attribution metadata, similar to a bibliography for data.†

Figure 2: We plot the distributions of licenses used in the DPCollection, a popular sample of the major supervised NLP datasets. We find a long tail of custom licenses, adopted from software for data. 73% of all licenses require attribution, and 33% share-alike, but the most popular are usually commercially permissive.

Our collected schema allows this store to succinctly capture the attribution (links to repositories, aggregator copies, papers, creators), provenance (text/machine sources, licenses), and compositional properties of the data (languages, tasks, text metrics, format, and time). This file of references and metadata, known as a Data Provenance Card enables comprehensive documentation, proposed by prior work, while providing some advantages from its structure. First, the Data Provenance Card can be easily searched, sorted, filtered and analyzed, whereas Datasheets or Statements, designed for individual datasets, are meant to be manually read. Second, developers can efficiently assemble relevant information without losing any detail, by symbolically linking to the original datasets and their documentation. Third, as datasets are continually re-packaged and absorbed into newer and bigger collections, Data Provenance Cards are easily adaptable by simply appending or concatenating them together. Altogether, we hope this tooling enables and promotes the thorough documentation proposed in prior work (Bender and Friedman, 2018; Gebru et al., 2021; Mitchell et al., 2019; Pushkarna et al., 2022)

3 Empirical Analysis of Data Provenance

3.1 Licenses in the Wild

This work constitutes the first extensive study of empirical license use for Natural Language Processing datasets. In this section, we share the insights we have gathered from our large-scale annotation and categorization. There is an important assumption in this section: the OpenAI Terms of Use is a contract, not a license, which prohibits the development of competing models using its outputs. For simplicity, we treat this as a Non-Commercial license in our analysis, though this is disputed for third parties who did not generate the OpenAI data themselves and therefore may not be bound by their terms (see Section 4 for discussion). Given the intention of OpenAI not to facilitate competitive commercial uses, we follow their categorization for this analysis.

Frequency of license types Figure 2 shows the distribution of licenses. The most common licenses are CC-BY-SA 4.0 (15.7%), the OpenAI Terms of Use (12.3%), and CC-BY 4.0 (11.6%). While most licenses are common and recognizable, there is a long tail of variants with unique settings, as well as a large set of Custom licenses accounting for 9.6% of all recorded licenses on their own. This wide license diversity illustrates the challenge to startups and less resourced organizations attempting to navigate responsible training data collection, its legality and ethics.

Table 2: The distribution of license use categories shows our licenses have far fewer “Unspecified” omissions than GitHub ( , 72%), Hugging Face ( , 69%), and Papers with Code ( , 70%), categorizing license more confidently into commercial or non-commercial categories. GitHub, Hugging Face, and Papers with Code match our licenses (green regions) 43%, 35%, and 54% of the time, respectively, and suggest incorrect licenses that are too permissive 29%, 27%, and 16% of the time.

Distribution of Restrictive Licenses In total, 85% of dataset licenses request attribution, and 30% include a share alike clause.‡ Datasets which request attribution pose challenges for practitioners who commonly train on hundreds of datasets and either don’t cite them at all (OpenAI, 2023; Anil et al., 2023; Touvron et al., 2023) or simply cite an aggregation of data, which often falls short of the license’s conditions of attributing the specific repository or paper. Futhermore, “Share alike” clauses poses challenges for practitioners re-packaging data collections usually with multiple conflicting share-alike licenses without a clear way to resolve them (like Longpre et al. (2023a); Wang et al. (2022c) and others in the DPCollection). Frequently, practitioners will over-write share-alike licenses with more restrictive or even less restrictive conditions.

Missing or Unspecified Licenses. Next, we compare our manually reviewed licensing terms, to the licenses for the same datasets, as documented in the aggregators GitHub, HuggingFace, and Papers with Code. Table 2 shows that these crowdsourced aggregators have an extremely high proportion of missing (“Unspecified”) licenses, ranging from 69-72%, as compared to our protocol which yields only 30% “Unspecified”. The problem with “Unspecified” licenses is that it is unclear whether it is due to a shortcoming of the aggregator or because creators intentionally released them without a license. Consequently, risk-averse developers are forced to avoid many valuable datasets, which they would use otherwise if they were given assurance that there is indeed no license. As part of DPCollection, we manually reassign 46-65% of dataset licenses (depending on the platform), resulting in much higher coverage, thus giving risk-averse developers more confidence and breadth in their dataset utilization.

Incorrectly Specified Licenses. Table 2 also finds real licenses as assigned by us are frequently stricter than the ones by aggregators. GitHub, Hugging Face and Papers with Code each label license use cases too permissively in 29%, 27%, and 16% of cases respectively. Our inspection suggests this is due to contributors on these platforms often mistaking licenses attached to code in GitHub repositories for licenses attached to data.

Table 3: The mean number of features (e.g. tasks or languages) per dataset, and the mean entropy of the distribution, representing the diversity of categories. Non-Commercial / Academic-Only datasets have consistently and statistically higher task, topic, and source variety than Commercial datasets. We use Normalized Shannon Entropy for discrete features, and Differential Entropy for continuous features, which are both measures of randomness.

3.2 How does Data Availability Differ by License Use Category?

While non-commercial and academic-only licenses play important roles in protecting data use, their presence can also exclude communities from participating (or competing) in the development of these technologies. In this section, we break down datasets according to their license restrictions and see how they differ. Specifically, we ask: Does complying with licenses dictate systematic differences in resources for commercially-permissive (“open”) and non-commercial (“closed”) development? And what particular features of data are particularly constrained by non-commercial prohibitions?

We compare datasets by categories of permitted use, according to their licenses: (1) Commercially viable, (2) Non-Commercial/Academic-Only (NC/A-O), or (3) Unspecified license. We group together NonCommercial and Academic-Only conditions as the distinction will rarely matter for developers. We argue in Section 4 that datasets without any license (Unspecified) have not imposed any conditions, so can often be treated as commercially viable, but this may depend on a developer’s risk tolerance and jurisdiction.

Non-Commercial & Academic-Only Licensed Datasets have statistically greater diversity in their representation of tasks, topics, sources, and target text lengths. For each of these features, Table 3 illustrates the mean number per dataset, broken down by license category and entropy to measure the randomness, and thus diversity, of each feature. NC/A-O datasets see greater diversity of tasks, topics, and sources represented in the text than commercial datasets. Figure 4 shows where this diversity comes from. The most NC/A-O task categories include Brainstorming, Explanation, Logic & Math, as well as Creativity and Creative Writing. In comparison, the most commercially viable task categories are Short Text Generation, Translation, and Classification. Similarly, among Source Domains, Governments and Search Queries are largely viable for commercial (and unspecified) purposes, whereas General Web, Exams, and Model-generated sources are among the most restrictive.

Target Text Lengths are significantly higher for NC/A-O datasets than commercial datasets. Not only do NC/A-O datasets appear more textually and functionally diverse, their length characteristics differ substantially. While Table 3 shows the input text lengths across license categories are similar on average, the target text lengths are significantly higher for NC/A-O datasets (103 vs 677). This breakdown is further illustrated in Figure 5, where we see greater representation of both NC/A-O and synthetic datasets above the 100 target token threshold (y-axis).

The rise of synthetic datasets generated using APIs with non-commercial terms of use may explain the differences in text diversity and length. Table 3 also shows a full 45% of NC/A-O datasets are synthetic, as compared to < 14% in more permissive license categories. Taori et al. (2023); Wang et al. (2022a); Xu et al. (2023a) and their variants, all generated in part using commercial APIs, exhibit stronger task and topic diversity than traditional academic datasets, as they cater to longer form generations, by design. This is 2023 has a large spike in license usage, and in NC/A-O licensed data, representing 61%, as compared to 20% on average in prior years. Among the large collection of datasets we trace, we record the date at which they are released, by cross-referencing their associated GitHub, ArXiv, and Hugging Face dates. We find a striking change in the pattern of licensing restrictions. As shown in Figure 3, prior to 2023, no year saw greater than 1/3 of the datasets released as NC/A-O. However, in 2023, which includes many of the most popular and diverse datasets, the NC/A-O rate is 61%. Furthermore, most datasets were unaccompanied by a license prior to 2022 (˜50-80%), as compared to only 12% in 2023. The shift to more license use, and more restrictively conditioned data releases may foretell future challenges to open data, if the trend continues.

Figure 3: The distribution of datasets in each time of collection (top) and language family (bottom) category, with total count above the bars, and the portion in each license use category shown via bar color. Red is Non-commerical/Academic-Only, Yellow is Unspecified, and Blue is Commercial. Lower resource languages, and datasets created in 2023 see a spike in non-commercial licensing.

Figure 4: The distribution of datasets in each Domain Source (top) and task (bottom) category, with total count above the bars, and the portion in each license use category shown via bar color. Red is Noncommerical/Academic-Only, Yellow is Unspecified, and Blue is Commercial. Creative, reasoning, and long-form generation tasks, as well as datasets sourced from models, exams, and the general web see the highest rate of non-commercial licensing.

Figure 5: Across finetuning datasets, we visualize their mean input (x-axis) and target (y-axis) text lengths, measured in log-scaled number of words. The colors indicate either their license use category (left) or whether they were machine generated or human collected (right). Long target texts are represented in large part by Non-Commercial and Synthetic datasets, that are often generated by commercial APIs.

Commercial datasets have greater language variety, but low-resource language datasets see the least commercial coverage. Table 3 shows that commercial datasets actually have greater diversity of languages than NC/A-O. However, when broken down by language family, as in Figure 3, we see stark differences in permitted use by group. Code language datasets are nearly all commercially viable (78%), because dataset creators can easily filter GitHub for permissively licensed repositories. Interestingly, English, Atlantic-Congo, and Afroasiatic languages also see large permissive representation. However, Turkic, Sino-Tibetan, Japonic, and Indo-European languages see in excess of 35% as non-commercial. Note that while the Indo-European language family contains many high-resource European language families, there is a long tail of lowerresource ones. These NC/A-O language families provide directions for open data practitioners to focus their future efforts.

3.3 Broader Characteristics of the Data

In addition to understanding systematic differences in the data by license, there are research questions regarding the overall composition and characteristics of these widely used and adopted datasets. Our compilation of metadata through the DPCollection allows us to map the landscape of data characteristics, and inspect particular features. Note that all these details are also available with interactive visualizations at www.comingsoon.com, for further research and examination.

https://openai.com/policies/terms-of-use

Figure 6: A global heatmap measuring how well each country’s spoken languages are represented by the composition of natural language datasets in DPCollection, as calculated by Section 3.3. English-speaking and Western European nations are best represented, while the Global South sees limited coverage.

Language representation is heavily skewed to English and Western European Languages. Following Talat et al. (2022)’s recommendations in data transparency and documentation in demographic analysis, and corroborating Kreutzer et al. (2022)’s similar analysis for pretraining corpora, we find a stark Western-centric skew in representation. Figure 6 illustrates the coverage per country according to the spoken languages and their representation in DPCollection. We compute a Language Representation score \(S_k\) for each country \(k\), parametrized by \(p_{kl}\), the percentage of people in country \(k\) that speak language \(l\), and \(w_{li}\) which is a binary indicator that is 1 if dataset \(i \in D\) contains language \(l\) and 0 otherwise.

\[S_k = \sum_{l \in L} \left( p_{kl} \times \sum_{i \in D} w_{li} \right)\]

상기 식에서 \(S_k\)는 국가 \(k\)에 대한 언어 대표성 점수를 나타내며, \(p_{kl}\)은 국가 \(k\)에서 언어 \(l\)을 사용하는 인구의 비율을 나타냄.

\(w_{li}\)는 데이터셋 \(i\)가 언어 \(l\)을 포함하면 1, 그렇지 않으면 0인 이진 지시자입니다.

The distribution visualized in Figure 6 shows that Asian, African, and South American nations are sparsely covered if at all. Even when nations from the Global South appear to have linguistic representation, according to Section 3.3, the text source and dialect of the language contained in these datasets almost always originates from North American or European creators and web sources (though this is difficult to measure precisely). These observations corroborate similar findings in the geo-diversity of image data in the vision domain (Shankar et al., 2017; De Vries et al., 2019; Mahadev and Chakravarti, 2021). The resulting models trained on these datasets are likely to have inherent bias, underperforming in critical ways for users of models outside of the west (Ahia et al., 2021).

The primary drivers of dataset curation are Academic organizations, supplying 69%, followed by 21% industry labs, and 17% research institutions. These metrics describe the scale of dataset curation contributions, but not the influence each dataset has had on the community. Table 4a demonstrates the single largest dataset contributors are AI2 (12.3%), University of Washington (8.9%), and Facebook AI Research (8.4%). It is important to note that these contributors often only download and compile text from the Internet that was originally written by other people.

Table 4: A summary of the distribution of Creators, Topics, and Source Domains across all 1800+ datasets. Datasts can have multiple creators, text topics, and sources.

Text datasets focus on topics of Language & Linguistics, General Knowledge, Logic, & Lifestyle. Prior data collection work focuses predominantly on describing datasets by their task compositions (Sanh et al., 2021; Wang et al., 2022a; Longpre et al., 2023a), but rarely by their actual topics (except (Gao et al., 2020) in their Appendix). Table 4b shows the most popular topics, clustered by category, with their representation across datasets. Like most NLP tasks, much of this text data focuses on communication and language understanding topics, followed closely by general knowledge, routine, sports, and education.

Text datasets are sourced primarily from Online Encyclopedias (22%), Social Media (16%), scraped from the General Web (11%), News (11%), Entertainment web resources (9%). While practitioners document their individual dataset sources in their published papers, this information is unstructured and can be hard to find. As a result, massive collections of widely used datasets rarely compile the distribution of their original sources, instead just citing the papers. After a series of dataset compilations and re-packaging, the original sources are often lost or not well known. By manually scanning approximately 500 academic papers our volunteers annotated the original text sources and compiled them into domain clusters, to permit attribution and analysis, as summarized in Table 4c. Among the individual most adopted sources by the used sources are wikipedia.org (14.9%), undisclosed webpage scrapes (7.0%), reddit (6.2%), and Twitter (4.0%). The least represented domains are Commerce, Reviews, Legal, Academic Papers, and Search Queries, among others.

Our empirical analysis highlights that we are in the midst of a crisis in dataset provenance and practitioners are forced to make decisions based on limited information and opaque legal frameworks. While we believe our tooling will enable better transparency about where licenses are in tension, major legal ambiguities remain in data licensing.

Background Copyright laws aim to encourage written and artistic expression by giving authors exclusive rights to copy, distribute, and adapt their work (Patterson, 2003; Burger, 1988). Open-source licenses first emerged as legal tools to encourage collaboration around software development (Von Krogh and Von Hippel, 2003). A range of licenses with different terms and purposes exists including the MIT License, Creative Commons Licenses, and the Apache License, as well as the newer Responsible AI License (RAIL) and AI2 ImpACT Licenses.¶ The interplay between copyright and licenses can be understood in the following way: copyright automatically gives creators exclusive rights in their works and creators assign these rights to others through license agreements. As we will explore, the open-source licenses that emerged in the last three decades are not always well-equipped to handle the unique characteristics of data, and especially supervised AI training data. Meanwhile, it remains unclear how relevant laws, including those related to copyright and fair use, should be applied to the unique challenges raised by Generative AI and supervised datasets (Lee et al., 2023). In this section, we highlight some of the key legal challenges and ambiguities related to supervised datasets.

Lifecycle of a dataset We focus on supervised datasets, which we define as datasets that are created for machine learning (mainly for finetuning and alignment) and where dataset creators made copyrightable contributions in the form of annotations or compilations. A typical supervised dataset is the result of a process that involves several stages of scraping (or machine generation) and annotation by different entities. Generally, raw data is created by people interacting with internet platforms, such as individuals writing articles, sharing artworks, or engaging in online discussion forums. The copyrights to this raw data are normally held by individual users (e.g. Reddit) or by the platform (e.g. Amazon Reviews). Much of this data has been scraped to construct unsupervised datasets for machine learning and this use is commonly justified on the basis of fair use or data mining exceptions to copyright (Henderson et al., 2023; Sobel, 2017; Lee et al., 2023; Samuelson, 2023; Lemley and Casey, 2020). However, we find that many common supervised datasets are generated by annotating small samples of scraped raw data using human annotators or large language models. The annotated data is then published with a license agreement. In stark contrast to the copyrighted content that is scraped from the web, supervised datasets were created for the sole purpose of furthering machine learning. The focus of the legal discussion in this section is on how supervised dataset creators can constrain the usage of the copyrightable content they create through licenses and other legal mechanisms. Though we do not address them here, there are several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets.

See https://www.licenses.ai/blog/2023/3/3/ai-pubs-rail-licenses and https://allenai.org/impact-license# licenses. These license templates propose terms aimed at encouraging more responsible or risk-based machine learning practices, see also Contractor et al. (2022)

Surpervised Dataset Example: SQuAD

Rajpurkar et al. (2016) present a prototypical supervised dataset on reading comprehension. To create the dataset, the authors take paragraph-long excerpts from 539 popular Wikipedia articles and hire crowd-source workers to generate over 100,000 questions whose answers are contained in the excerpt.

For example:

Wikipedia Excerpt In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. Worker-generated question: What causes precipitation to fall? Answer: Gravity

Here the authors use Wikipedia text as a basis for their data and their dataset contains 100,000 new question-answer pairs based on these texts.

Copyright laws vary by jurisdiction and are subjective, so it is challenging to develop technical safeguards that guarantee compliance. The legal analysis surrounding supervised datasets is complicated by the lack of a uniform global legal framework to address copyright concerns. Different jurisdictions have different and evolving laws. Therefore, the location of model developers and training data creators as well as where and when data was collected may influence the legal analysis. For example, the United States has a fair-use exception to copyright that allows the limited use of copyrighted material under certain circumstances without requiring permission from the rights holders (17 U.S.C. §107) . The EU has no fair-use provision but does have an explicit copyright exception to allow data mining under certain conditions, like obtaining lawful access to the data (Margoni and Kretschmer, 2022). Meanwhile, datasets themselves generally enjoy copyright protection in the U.S. (Lee et al., 2023) while the E.U. recently created a unique set of rights for dataset creators with the purpose of incentivizing research and development related to databases (Derclaye and Husovec, 2022). In addition to differences across jurisdictions, there are also several international agreements related to copyright Ricketson and Ginsburg (2022). Ultimately, it can be challenging to determine which laws should apply to a given machine learning project when the relevant rules vary between the locations where the data was scraped and annotated, where it was downloaded, where the model was trained, and where the model was deployed.

While geographical disparities in regulatory frameworks present one set of challenges, the subjectivity inherent in determining whether copyright infringement has occurred makes it even more challenging to design technical safeguards. For example, in the U.S. part of the copyright infringement analysis depends on whether two works are subjectively similar from the perspective of an ordinary person (Mohler, 1999; Cohen, 1986; Balganesh et al., 2014). This is a subjective standard and existing case law may be challenging to extend to generative AI outputs. As a result, while there are technical strategies that can reduce the risk of infringement (Henderson et al., 2023; Sag, 2023; Vyas et al., 2023), it will be difficult for developers to create technical safeguards that eliminate this risk entirely.

Open legal question regarding copyright and model training. Apart from these jurisdictional and interpretive ambiguities, the process of training a model raises specific copyright questions (Epstein et al., 2023). Training a model poses several interesting legal questions with respect to copyright and infringement may occur in several ways even before any outputs are generated.

First, the act of creating a training dataset by scraping existing works involves making a digital copy of the underlying data. As the name implies, copyright gives the author of a protected work the exclusive right to make copies of that work. If the scraped data is protected by copyright, then creating training data corpora may raise copyright issues (Quang, 2021). Second, copyright holders generally have an exclusive right to create derivative works (e.g., translations of a work) but it is not clear whether a trained machine learning model should be considered a derivative of the training data (Lee et al., 2023). If models are considered to be derivative works, then training a model would be more likely to violate the rights of the training data’s copyright holders (Gervais, 2021).

In the U.S., the fair use exception may allow models to be trained on protected works (Henderson et al., 2023; Lemley and Casey, 2020; Sobel, 2017; Samuelson, 2023). As these authors explain, the training of machine learning models on copyrighted content may be permissible if the underlying works are significantly “transformed” into model weights, only a small amount of each work in the training data is included in the trained model, model training is designed to only glean generalizable insights from the training data, and the trained model does not have a strong effect on the economic success of the works in the training data. It is important to underscore that, while training a machine learning model itself may be protected by fair use this does not mean that model outputs will not infringe on the copyright of prior works. As the authors above highlight, the application of fair use in the context is still evolving and several of these issues are currently being litigated (see e.g., Andersen v. Stability, Doe v. GitHub, and Tremblay v. OpenAI).

Fair use is less likely to apply when works are created for the sole purpose of training machine learning models as in the case of supervised datasets with copyrightable compositions or annotations. The prior literature on fair use and machine learning tends to focus on copyrighted art or text that was scraped to train a model. These scraped works were not created for the purpose of training machine learning models. By contrast, in this paper, we focus on supervised datasets that were created for the sole purpose of training machine learning models. As underscored by Henderson et al. (2023) and Sobel (2017), the fair use analysis depends in part on whether a trained model copies the “expressive purpose” of the original work. While the expressive purpose of a piece of text or art is not to train machine learning models, the purpose of a training dataset is to do just that. As a result, we expect that it is less likely that fair use would apply to the use of curated data. Instead, the creators of these datasets hold a copyright in the dataset‖ and the terms of the dataset license agreement govern the subsequent use of this data However, it is rare in practice for an LLM to use a single supervised dataset and often multiple datasets are compiled into collections. This further complicates the legal analysis because we find that the license terms of many popular dataset collections are conflicting.

Licenses used for datasets are often ill-suited for this purpose. Beyond the intricate interplay between training data and fair use, the frequently misapplied licensing frameworks for datasets present another set of complications. Most open-source licenses were designed for software, but we find them being attached to datasets. These licenses were intended to be applied to software, not data, which creates challenges (Meeker, 2022). One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to “derivative” or “adapted works” but it remains unclear if a trained model should be classified as a derivative work. This issue is further exacerbated when multiple datasets, each potentially governed by a different open-source license, are amalgamated into collections. If the requirements of the underlying license agreements are irreconcilable, such as different copyleft requirements, this makes it extremely hard for developers to use certain collections while respecting all license terms. To remedy these issues, new licenses are being proposed to address the needs of machine learning datasets such as the BigScience Responsible AI License or an adaptation of the MIT License that requires additional permissions for model training proposed by Ioannidis et al. (2023). Despite these new proposals, we find that the majority of datasets are licensed under conventional open-source licenses.

Data ownership and data copyright are complex topics (Ginsburg, 1992). We assume that the creators of supervised datasets have some form of copyright in their dataset, though there is often content in these datasets that is owned by third parties. If they satisfy the requirements for copyrightability, dataset creators would have a copyright interest in any new content they create (e.g. annotations). In the U.S., datasets themselves may also be copyrightable as compilations (Lee et al., 2023) while the E.U. provides so-called sui generis rights for databases (Derclaye and Husovec, 2022).

LLM-generated annotations raise additional legal considerations We find that approximately 12% of the datasets we audit were annotated using OpenAI. The OpenAI Terms of Use state that outputs from the OpenAI service may not be used to “to develop models that compete with OpenAI”∗∗. These terms seem to preclude a developer from using OpenAI to generate training data to train a competing LLM. However, it is not clear whether they would also limit the ability of a developer to use OpenAI to create and publish an annotated dataset. On the one hand, publishing such a dataset does not directly compete with OpenAI. On the other hand, it seems foreseeable that such a dataset could enable third parties (who did not themselves use OpenAI) to create competing LLMs. In the U.S., there are several doctrines of secondary or indirect copyright liability aimed to enforce copyright in cases where there is no direct infringement (Grossman, 2005; Lee et al., 2023). The application of these doctrines depends on many factors, most importantly on whether OpenAI has a copyright interest in its outputs. If these copyright doctrines do not apply, then it is still possible that publishing the dataset constitutes a breach of contract by the dataset developers. While it would be more challenging for OpenAI to pursue a case against third parties, there are myriad other business torts, from unfair competition to misappropriation, that may be relevant to this situation, and which go beyond the scope of this paper (Marks and Moll, 2023). Time will tell the extent to which OpenAI and other LLM service providers can enforce their terms of use against third parties. However, a prominent researcher at Google has already resigned citing concerns that OpenAI outputs were used to train BARD (Victor and Efrati, 2023) In light of these legal ambiguities, our tool gives developers the ability to exclude OpenAI-generated datasets.

While legal issues remain ambiguous, practitioners are making decisions on data use and model training. In the face of these pervasive legal uncertainties, practitioners’ decisions regarding data usage are ultimately guided by a blend of factors including the specific licensing terms, the origin of datasets, and the degree of usage of a given dataset by others. Navigating this landscape requires striking a delicate balance between risk mitigation and the need for sufficient resources. This equation, however, varies across regions, applications, and corporate environments, influenced by factors such as competition, risk, and regional legislation. A strategy for partially mitigating these uncertainties is for model providers to indemnify users, as done by Google Cloud Suggs and Venables (2023). However, this may not be feasible for resource-constrained developers and, while it protects end-users, it does not solve the issues faced by model developers or dataset curators.

Our Approach. The fundamental purpose of copyright is to encourage creativity and innovation. As we highlighted in the sections above, the current legal landscape remains ambiguous and this lack of clarity can stifle innovation as developers fear legal repercussions. Through our audit and tooling, we seek to provide important information for practitioners to make informed decisions in an otherwise ambiguous landscape, guided by their own own legal interpretation and risk tolerance. This information includes data license lineages, a categorization of license terms, details on data creators, and the underlying data sources (e.g. web or LLM). In light of ongoing litigation and a lack of legal certainty, we attempted to give developers In creating a repository of data licensing information, we are also taking a step towards encouraging dataset creators to be more thoughtful about the licenses that they select. Dataset creators are well-positioned to understand the appropriate uses of the datasets they publish and licenses can be a tool to communicate these restrictions and to encourage responsible AI development. We further aim to highlight that machine learning practitioners should take dataset license terms seriously, as they may have real impacts on how their models may be used in practice. Ultimately, thoughtful data licensing could be leveraged to promote more responsible, inclusive, and transparent machine learning practices.

NOTICE: Collected License Information is NOT Legal Advice. It is important to note we collect selfreported licenses, and categorize them according to our best efforts, as a volunteer research and transparency initiative. The information provided by any of our works and any outputs of the Data Provenance Initiative do not, and are not intended to, constitute legal advice; instead, all information, content, and materials are for general informational purposes only. Readers and users should seek their own legal advice from counsel in their relevant jurisdiction.

Data Documentation A long line of work has highlighted the importance of data and its documentation in natural language processing (Paullada et al., 2021; Rogers, 2021; Meyer et al., 2023; Gururangan et al., 2018; Muennighoff et al., 2023b). In particular, these works stress the challenges posed by poor documentation to reproducibility, good science, and generally well-understood model behavior (Sambasivan et al., 2021a; Bandy and Vincent, 2021; Longpre et al., 2023b). Recent work has also explored the importance of documenting AI ecosystems (Bommasani et al., 2023b) and the supply chain from data to models (Cen et al., 2023).

Data Analysis and Exploration Several notable works have conducted large-scale analyses into data, particularly pretraining text corpora (Gao et al., 2020; Dodge et al., 2021; Kreutzer et al., 2022; Laurençon et al., 2022; Scao et al., 2022a,b; McMillan-Major et al., 2022). Other works have investigated the geo-diversity of vision-based datasets (Shankar et al., 2017; De Vries et al., 2019; Mahadev and Chakravarti, 2021). Different forms of data governance have been proposed to centralize responsibility and documentation over datasets, including for the BigScience project (Jernite et al., 2022) and a Public Data Trust (Chan et al., 2023). In terms of finding and visualizing datasets, a few recent tools have been proposed (Färber and Leisinger, 2021; Viswanathan et al., 2023).

Transparency and accountability Adjacent to the realm of legality, prior works have strongly advocated and provided frameworks for documentation and audits to increase transparency and accountability in AI systems (Miceli et al., 2022; Kapoor et al., 2023; Raji and Buolamwini, 2022). In a manner akin to DPI, which draws upon the collective knowledge of legal and machine learning experts, earlier research has also underscored the significance of interdisciplinary collaborations (Hutchinson et al., 2021). Datasheets for datasets Gebru et al. (2021) and Data Statements Bender and Friedman (2018) both provide structured frameworks for revealing essential metadata such as the motivation behind intended use. Pushkarna et al. (2022) expanded on datasheets with “Data Cards” for sources, collection, ethics, and adoption.

Similarly, Mitchell et al. (2019) introduced model cards to benchmark model performance across demographic groups and disclose evaluation procedures. Crisan et al. (2022) proposed interactive model card as an alternative mode of documentation and metadata sharing. Complementary to transparency regarding the dataset’s creation process, Corry et al. (2021) provide a framework that guides users on how to navigate datasets as they approach the end of their life-cycle. DPI builds upon the foundational frameworks laid out in these earlier studies, with a specific focus on addressing the licensing aspects of dataset curation. Our goal is to equip users with a comprehensive understanding of the legal risks associated with dataset usage.

Dataset legality The legality of the datasets used to train large base models has recently received significant attention (Sag, 2020; Henderson et al., 2023). The challenge of determining the legality of employing different datasets becomes particularly complex due to the intricate nature of dataset creation processes. Lee et al. (2023) break up the stages of dataset creation and model generation and assess the relevant copyright questions in the US legal system. These processes often involve multiple licenses and restrictions that can interact in ways that obscure the final legal risk. Soh (2021) propose a high-level framework for pinpointing the areas within dataset creation and usage where legal analysis is necessary, but do not apply this framework to any existing datasets. Min et al. (2023) demonstrate that refraining from training on copyrighted or highly restricted datasets has a detrimental impact on downstream performance. Their proposed solution involves using a language model trained on “low-risk” text and augmenting it with a data-store containing “high-risk” text which can be modified appropriately as the legal landscape clarifies over time. (Lee et al., 2023) DPI enhances these investigations by involving legal experts in the development of a framework for assessing a dataset’s “risk” and annotating the “risk” associated with numerous existing high-profile datasets.

Previous: Weight Alignment Tuning Google - WARP** Next: Model | Llemma Pile-2

post contain ""

    No matching posts found containing ""