00:00:00

Share Your Feedback 🏝️

Model | DOLLY

Model | DOLLY

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: LFM | Meta - LLaMA 2 Next: Model | Alpaca

Model | DOLLY

  • Related Project: private
  • Category: Paper Review
  • Date: 2023-08-09

TL;DR


  • 데이터셋 제작 및 훈련: 고성능 LLM 생성을 위한 고품질 데이터셋 확보 및 효율적 훈련.
  • 수학적 알고리즘 활용: 수학적 수식과 알고리즘을 통한 데이터 정제 및 모델 최적화.
  • 연구 및 기술 발전: 개방형 소스와 협업을 통한 AI 기술의 민주화 추진.

[데이터 퀄리티 색인마킹]


[선행 연구]

선행 연구에서는 대규모 언어모델(LLM)의 발전이 데이터의 양과 품질에 의해 좌우됨을 밝혀왔습니다. 초기 모델들은 간단한 문장 구조의 데이터를 사용했으나, 최근 연구들은 문맥 이해를 위해 전체 문서를 포함한 데이터셋을 사용하는 방향으로 발전하였습니다. 예를 들어, GPT-3와 같은 모델은 트릴리언 단위의 단어를 training dataset로 사용하여 높은 성능을 달성했습니다. 이런 모델들은 대부분 막대한 컴퓨팅 리소스를 필요로 하며, 고품질 데이터의 확보가 중요하다고 언급합니다.

[문제 정의]

현재까지의 LLM은 막대한 데이터와 고성능 컴퓨팅 자원을 필요로 하며, 이로 인해 기술적 및 경제적 장벽이 존재합니다. 또한, 기존 데이터셋은 종종 저품질의 데이터를 포함하고 있어 모델의 성능을 저하시킬 수 있습니다.

[해결 방안]

본 연구에서는 저비용으로 고성능 LLM을 구축할 수 있는 새로운 접근 방식을 제안합니다. 이는 고품질의 휴먼 생성 데이터셋을 사용하여 기존의 낮은 비용의 LLM을 재훈련하는 방식으로, 소규모 데이터셋에서도 높은 성능을 발휘할 수 있도록 합니다. 이 방법은 데이터의 품질을 최적화하고, 모델 훈련 과정에서의 비용을 절감할 수 있습니다.


[방법]

[데이터 정제 및 준비]

  • 품질 기준 설정: $ P_{filtered} = {d \in D | f(d) > \theta} $
  • 데이터 중복 제거: $ D_{unique} = \bigcup_{i=1}^{n} {d_i \in D | \not\exists j < i, d_j \equiv d_i} $

    • \(D_{unique}\)는 유니크한 요소들만을 포함하는 집합
    • \(\bigcup\) 기호는 여러 개의 집합을 합집합하는 연산
    • \(i\)는 1부터 \(n\)까지의 인덱스, \(n\)은 집합 \(D\)의 요소 수
    • \(d_i\)는 집합 \(D\) 내의 요소
    • 조건 \(\not\exists j < i, d_j \equiv d_i\)는 \(i\)보다 작은 어떤 인덱스 \(j\)에 대해서도 \(d_j\)가 \(d_i\)와 동일하지 않다는 것을 의미하며, 이 조건은 \(d_i\)가 그 이전의 인덱스에서 이미 나타난 적이 없음을 보장

    상기 수식은 집합 \(D\) 내에서 각 요소 \(d_i\)가 처음 등장하는 순간에만 그 요소를 포함하여 유니크한 요소들로 구성된 집합 \(D_{unique}\)을 형성해서 중복을 제거하고 각 요소가 최초로 나타난 순서대로 그 요소들을 수집하는 방식으로 해석

[데이터 중복 제거 deduplication 색인마킹]


[모델 훈련 및 최적화]

  • 초기 모델: EleutherAI의 6억 파라미터(6B) 모델 사용
  • 세부 훈련: 특정 명령어 수행 능력 강화를 위한 데이터셋에 집중적으로 튜닝

[데이터셋 및 벤치마크]

본 연구에서 사용된 databricks-dolly-15k 데이터셋은 휴먼이 생성한 15,000개의 명령어/응답 쌍으로 구성되어 있습니다. 이 데이터셋은 다양한 명령어 수행 능력을 평가하기 위해 설계되었으며, 상업적 활용이 가능한 첫 번째 오픈 소스 데이터셋입니다. 이를 통해 Dolly 모델은 ChatGPT와 유사한 수준의 상호작용 능력을 발휘할 수 있었습니다.


[결론 및 기대 효과]

이 연구는 비용 효율적인 방법으로 고성능 LLM을 구축할 수 있는 방법을 제시하며, 고품질 데이터의 중요성과 효과적인 데이터 정제 및 처리 방법이 모델의 성능에 미치는 영향을 강조합니다.


Model Blogs

[1] Hello Dolly: Democratizing the magic of ChatGPT with open models


  • Summary

    We show that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data. Surprisingly, instruction-following does not seem to require the latest or largest models: our model is only 6 billion parameters, compared to 175 billion for GPT-3. We open source the code for our model (Dolly) and show how it can be re-created on Databricks. We believe models like Dolly will help democratize LLMs, transforming them from something very few companies can afford into a commodity every company can own and customize to improve their products.

  • Background

    ChatGPT, a proprietary instruction-following model, was released in November 2022 and took the world by storm. The model was trained on trillions of words from the web, requiring massive numbers of GPUs to develop. This quickly led to Google and other companies releasing their own proprietary instruction-following models. In February 2023, Meta released the weights for a set of high-quality (but not instruction-following) language models called LLaMA to academic researchers, trained for over 80,000 GPU-hours each. Then, in March, Stanford built the Alpaca model, which was based on LLaMA, but tuned on a small dataset of 50,000 human-like questions and answers that, surprisingly, made it exhibit ChatGPT-like interactivity.

  • Introducing Dolly

    Today we are introducing Dolly, a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT. Whereas the work from the Alpaca team showed that state-of-the-art models could be coaxed into high quality instruction-following behavior, we find that even years-old open source models with much earlier architectures exhibit striking behaviors when fine-tuned on a small corpus of instruction training data. Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.

    The model underlying Dolly only has 6 billion parameters, compared to 175 billion in GPT-3, and is two years old, making it particularly surprising that it works so well. This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models. We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it’s an open source clone of an Alpaca, inspired by a LLaMA. We’re in the earliest days of the democratization of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models.

    We evaluated Dolly on the instruction-following capabilities described in the InstructGPT paper that ChatGPT is based on and found that it exhibits many of the same qualitative capabilities, including text generation, brainstorming, and open Q&A. Of particular note in these examples is not the quality of the generated text, but rather the vast improvement in instruction-following capability that results from fine-tuning a years-old open source model on a small, high-quality dataset.


[2] Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

Two weeks ago, we released Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following). Today, we’re releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.

We are open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.

  • Databricks dolly 15k dataset

    databricks-dolly-15k contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

    To the best of our knowledge, this dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT. databricks-dolly-15k was authored by more than 5,000 Databricks employees during March and April of 2023. These training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.

  • Why did we create a new dataset?

    As soon as we released Dolly 1.0, we were inundated by requests from people who wanted to try it out. The number one question that we kept getting was “can I use this commercially?”

    A critical step in the creation of Dolly 1.0, or any instruction following LLMs, is to train the model on a dataset of instruction and response pairs. Dolly 1.0 was trained for $30 using a dataset that the Stanford Alpaca team had created using the OpenAI API. That dataset contained output from ChatGPT, and as the Stanford team pointed out, the terms of service seek to prevent anyone from creating a model that competes with OpenAI. So, unfortunately, the answer to this common question was, “probably not!”

    As far as we know, all the existing well-known instruction-following models (Alpaca, Koala, GPT4All, Vicuna) suffer from this limitation, prohibiting commercial use. To get around this conundrum, we started looking for ways to create a new dataset not “tainted” for commercial use.

  • How did we do it?

    We knew from the OpenAI research paper that the original InstructGPT model was trained on a dataset consisting of 13,000 demonstrations of instruction following behavior. Inspired by this, we set out to see if we could achieve a similar result with Databricks employees leading the charge.

    Turns out, generating 13k questions and answers is harder than it looks. Every answer has to be original. It can’t be copied from ChatGPT or anywhere on the web, or it would taint our dataset. It seemed daunting, but Databricks has over 5,000 employees who are very interested in LLMs. So we thought we could crowdsource among them to create an even higher quality dataset than the 40 labelers had created for OpenAI. But we knew they were all busy and had full-time jobs, so we needed to incentivize them to do this.

    We set up a contest, where the top 20 labelers would get a big award. We also outlined 7 very specific tasks: Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large. 2. Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?” 3. Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage. 4. Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary. 5. Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”. 6. Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review. 7. Creative writing: This task would include things like writing a poem or a love letter.

  • Example of Open QA in databricks-dolly-15k

    How do I build a campfire?

    Safety should always come first when starting a campfire. Ensure you have water or another way to extinguish your fire readily available. Use a campfire ring, pit, or clear a large area around your campfire location. Also, make sure campfires are permitted in your area.

    With safety out of the way, all a fire needs are heat, fuel, and oxygen. Any combustible material will start a fire, but good campfire fuel sources include tinder, kindling, and dry logs.

    Use stuff like dry pine needles, dry leaves, grasses, and wood shavings for tinder. You could also use newspaper (or any paper), dryer lint, or cardboard. For kindling, use smaller pieces of wood, such as twigs and sticks no thicker than 1 inch in diameter (about the thickness of your thumb).

    Dry or well-seasoned logs will ensure a good burn. Stack the wood in a way that guarantees oxygen can easily reach your flame. Many prefer a teepee or log cabin stacking design. Now, use a heat source to create a flame or spark and apply this to your tinder. Once the tinder combusts, it should light the kindling, and the kindling should eventually ignite your logs. If all goes well, you should now be enjoying your campfire.

  • Our journey to create a commercially viable model

    We also wanted to produce an open source model that can be commercially used. Despite databricks-dolly-15k being substantially smaller than Alpaca, the dataset on which Dolly 1.0 was trained, the resulting Dolly 2.0 model, based on EleutherAI’s pythia-12b, exhibited high-quality instruction following behavior. In hindsight, this isn’t surprising. Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors.

  • Truly open large language models

    We’ve heard repeatedly from our customers that they would be best served by owning their models, allowing them to create higher quality models for their domain specific applications without handing their sensitive data over to third parties.

    We also believe that the important issues of bias, accountability and AI safety should be addressed by a broad community of diverse stakeholders rather than just a few large companies. Open-sourced datasets and models encourage commentary, research and innovation that will help to ensure everyone benefits from advances in artificial intelligence technology.

    As a technical and research artifact, we don’t expect Dolly to be state-of-the-art in terms of effectiveness. However, we do expect Dolly and the open source dataset will act as the seed for a multitude of follow-on works, which may serve to bootstrap even more powerful language models.

  • Resources

Previous: LFM | Meta - LLaMA 2 Next: Model | Alpaca

post contain ""

    No matching posts found containing ""