00:00:00

Share Your Feedback 🏝️

Model | Open Assistant

Model | Open Assistant

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | FLAN, Scaling Instruction-Finetuned Language Models** Next: Model | Guanaco

Model | Open Assistant

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-08-03

OpenAssistant Conversations - Democratizing Large Language Model Alignment

  • url: https://arxiv.org/abs/2304.07327
  • pdf: https://arxiv.org/pdf/2304.07327
  • abstract: Aligning large language models (LLMs) with human preferences has proven to drastically improve usability and has driven rapid adoption as demonstrated by ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) greatly reduce the required skill and domain knowledge to effectively harness the capabilities of LLMs, increasing their accessibility and utility across various domains. However, state-of-the-art alignment techniques like RLHF rely on high-quality human feedback data, which is expensive to create and often remains proprietary. In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. To demonstrate the OpenAssistant Conversations dataset’s effectiveness, we present OpenAssistant, the first fully open-source large-scale instruction-tuned model to be trained on human data. A preference study revealed that OpenAssistant replies are comparably preferred to GPT-3.5-turbo (ChatGPT) with a relative winrate of 48.3% vs. 51.7% respectively. We release our code and data under fully permissive licenses.

Contents

TL;DR


  • 인공지능 모델의 수렴 문제와 휴먼 가치의 조화
  • 대규모 대화 데이터셋을 이용한 언어 모델의 효율적인 교육 방법 제시
  • 머신 러닝을 통한 휴먼 반응 예측 최적화의 수학적 접근 방식 탐구

  • 대규모 언어모델(LLMs)을 휴먼의 선호도에 맞추는 것은 사용성을 향상시키며, 이는 ChatGPT의 빠른 적용을 통해 입증되었으며, 지도 학습을 통한 파인튜닝(Supervised Fine-Tuning, SFT) 및 휴먼 피드백으로부터의 강화 학습(Reinforcement Learning from Human Feedback, RLHF)과 같은 정렬(Alignment) 기술은 LLM의 능력을 효과적으로 활용하는 데 필요한 기술과 도메인 지식을 줄이면서, 다양한 분야에서 그 접근성과 유용성을 높이는 것으로 확인함.
  • RLHF와 같은 SOTA 정렬 기술은 고품질의 휴먼 피드백 데이터에 의존하며, 이는 생성하기에 비용이 많이 들고 종종 독점적으로 유지되기 때문에, 대규모 정렬에 대한 연구를 민주화하기 위해, OpenAssistant Conversations이라는 휴먼이 생성하고 주석을 단 어시스턴트 스타일 대화 코퍼스를 공개함.
    • 이 코퍼스는 35개의 다른 언어로, 161,443개의 메시지와 66,497개의 대화 트리, 461,292개의 품질 평가로 구성

1 서론

인공지능, 특히 자연어 처리 분야는 최근 몇 년간 빠른 발전을 보였습니다. 이런 발전은 대부분 간단한 트랜스포머 기반 아키텍처의 사용, 모델의 깊이와 너비를 확장하여 파라미터 수를 늘리고, training dataset베이스를 크게 확대하는 단순한 공식에 의해 주도되었습니다. 모델은 training dataset에 대한 향상된 적합성과 일반화 능력을 오랫동안 보여주었으나, 최근에야 일반 대중 사이에서 그 사용이 증가하기 시작했습니다. 이는 모델의 예측과 최종 사용 목적 간의 불일치 때문입니다. AI 시스템이 훈련 목표를 성공적으로 최적화하는 것뿐만 아니라 그 예측이 의도된 목적과 일치하고 휴먼이 제공한 윤리적 및 안전 기준을 준수하도록 하는 과정을 말합니다. 이를 해결하기 위한 한 방법으로 최근 대규모 언어모델을 휴먼의 선호에 더 부합하도록 하는 어시스턴트 스타일의 파인튜닝이 제안되었습니다.


2 데이터 형식

데이터셋의 기본 구조는 대화 트리(Conversation Tree, CT)로, 노드가 대화에서의 메시지를 나타냅니다. CT의 루트 노드는 프롬프트 제공자에 의해 제공된 초기 프롬프트를 나타냅니다. 모든 트리 노드는 역할에 따라 레이블이 붙고, 각각의 노드는 반대 역할의 여러 자식을 가질 수 있습니다. 이 경로는 유효한 대화를 나타내며 프롬프터와 어시스턴트가 번갈아 가며 참여합니다.


3 데이터 수집

Open Assistant Conversations 데이터셋은 13,000명 이상의 자원봉사자가 참여한 크라우드소싱을 통해 수집된 대화 데이터의 포괄적인 컬렉션입니다. 데이터 수집은 웹 앱 인터페이스를 사용하여 진행되었으며, 프롬프트 생성, 레이블링, 답변 추가, 레이블링 및 어시스턴트 답변 순위 매기기의 다섯 단계로 나누어졌습니다. 데이터는 콘텐츠 조정 및 스팸 필터링을 주요 구성 요소로 하여 고품질과 안전 기준을 유지합니다.

3.1 단일 단계 수집

데이터 수집은 작업을 단일 단위로 나누어 효율적이고 효과적인 방법으로 구조화됩니다. 이 접근 방식은 사용자 이탈로 인한 데이터 손실을 최소화하고 작업 단위를 포착하여 활용합니다.

3.3 순위 통합

RLHF(휴먼 피드백에서의 강화 학습)는 언어 모델의 출력 분포를 휴먼 순위자가 제공한 선호 구조를 사용하여 최적화하는 기술입니다. 이를 위해 각각의 가능한 응답에 대한 K 독립적인 순위자의 의견을 통합하는 과정이 필요합니다. 이는 “순위 쌍” 또는 “Tideman의 방법”으로 알려진 순위 투표 문제로 처리됩니다. 이 방법은 다른 요소들에 대한 선호도의 강도에 따라 “승자”의 정렬된 목록을 생성합니다.

Tideman 방법은 1987년 Nicolaus Tideman에 의해 고안된 것으로, 순위 투표의 결과를 결정하는 데 사용되며 Condorcet 우승자, 즉 다른 모든 후보와의 일대일 대결에서 승리하는 후보를 찾는 것을 목표로 함. 특히, 우승자가 존재하지 않는 경우에도 최선의 순위를 결정할 수 있으므로 선택이 애매할 경우에 주로 사용

  • 통상적인 Tideman method 순서
    • Pairwise Comparisons > Determine Wins and Losses > Sort Pairs by Strength of Victory > Construct the GraphRLHF > Determine Final Ranking

RLHF 혹은 다양한 휴먼 선호도(Human Preference or Feedback)데이터 레이블링 시 Likert scale 혹은 Tideman method가 흔하게 사용되는 경향이 있음.


4 데이터셋 구성

전체 데이터셋은 161,443개의 메시지와 66,497개의 대화 트리로 구성되어 있으며, 35개 언어로 어노테이션 되어 있습니다. 이 데이터셋은 창의성, 품질, 유머, 도움됨, 폭력성, 무례함 등의 카테고리에 대해 리커트 척도 휴먼 레이블이 수집됩니다. 추가적으로, 각 어시스턴트 메시지는 동일한 프롬프트에 제출된 다른 어시스턴트 메시지와 비교하여 순위가 매겨집니다.


1 Introduction

Artificial intelligence (AI), particularly in the field of natural language processing, has witnessed rapid progress in recent years. Major advancements are primarily driven by a straightforward formula: Take a simple transformer-based architecture, increase the number of parameters by enlarging the specified depth and width, and finally, significantly scale the training corpus. Although models have for some time exhibited an extraordinary, super-human ability to fit the training data and generalize based on their trained objective [1,2], their adoption among the general public has until recently been slow. This can be mainly attributed to misalignment between the model’s predictions and the final intended usage. The alignment of AI systems with human values, intentions, and preferences is a vital and intricate challenge within the AI research domain. This refers to the process of ensuring that AI systems can not only successfully optimize surrogate provided training objectives, but also that their predictions are in line with their intended purpose and adhere to ethical and safety standards provided by humans. One possible solution is assistant-style fine-tuning of language models that has recently emerged as a promising approach to making large language models more in line with human preferences by generating more desirable outputs [3] and thus making them more useful.

A notable instance of such an assistant-style model is ChatGPT, which has gained unprecedented user growth due to remarkable capabilities demonstrated in a wide range of fields, but also the ease-of-use for the end user. Aligning the model’s predictions is in this case accomplished by introducing human-generated examples of intended usage and using reinforcement learning from human feedback (RLHF) [4,5]. In RLHF, the human acts as a teacher and provides feedback in the form of rewards or penalties. In more detail, Ouyang et al. [4] proposed a three-stage procedure to align language models:

  • Collect human-generated demonstrations of desired behavior and train a supervised fine-tuned (SFT) model.
  • Train a reward model (RM) on human-annotated rankings for different model outputs.
  • Use the RM as a reward function and fine-tune the SFT model to maximize the reward generated by its responses. This is achieved using the PPO algorithm [6].

It becomes apparent that the benefits of all the previous aforementioned stages are predominantly imposed by the quality of the data used [7]. Despite this, availability of large-scale human feedback datasets for the open research community remains scarce. Most openly accessible datasets are comprised of synthetic data of instructions automatically generated by querying language models [8, 9, 10, 11, 12]. Unfortunately, these datasets are limited with respect to their complexity, creativity, and quality, as they rely on a pre-specified list of possible instruction types. Without sufficiently broad and high-quality data, even models with substantial size and pre-training would prove inadequate for building capable, helpful, and harmless AI assistants. Research in this area has predominantly been confined to a select few research labs with access to the required resources to engage in large-scale training and data collection. This monopolization of access to quality data undermines the potential for inclusive and diverse research endeavors, particularly in relation to alignment challenges, which arguably constitute some of the most crucial research areas of our time. In an effort to democratize research on aligning large language models, we introduce and release the Open Assistant Conversations dataset. This dataset is the culmination of an extensive open-and-crowd-sourcing initiative, and its release to the research community seeks to promote more inclusive research in this highly influential domain. We provide a comprehensive analysis of the dataset, assessing ethical implications and safety considerations. We also fine-tune and release several assistant and preference models to further advance open access and research in this area. This transparency allows for iterative improvements on the released artifacts, fostering a more collaborative and inclusive research environment. Our belief is that our work makes a noteworthy contribution towards creating a research landscape that is more inclusive and democratized, thereby providing opportunities to researchers from diverse backgrounds. In the following sections, we delve into the intricacies of the Open Assistant Conversations dataset and discuss its implications for the alignment of large language models and for society at large.

2 Data Format

The basic data structure is a Conversation Tree (CT) with nodes representing written messages in a conversation. ACT’s root node represents an initial prompt, given by the prompter. To avoid confusion, we call the roles of the conversation prompter and assistant. This allows us to reserve the term user for the human contributors. Both the prompter and assistant roles can be fulfilled by either a human user or a machine. Every tree node is labeled by its role and can have multiple children of the opposite role, each of which represents a separate next step in the conversation. A path from the root to any node in the tree (including to itself) represents a valid conversation with prompter and assistant taking turns and is called a thread. Tree nodes are annotated with additional data such as user-provided labels and metadata, such as collection timestamp and indicated language. Each assistant node further has a rank associated which orders it compared to replies of the parent prompt, according to user preferences.

3 Data Collection

The Open Assistant Conversations dataset is a comprehensive collection of conversational data that was obtained through a crowdsourcing effort involving more than 13,000 volunteers. The data was collected using a web-app interface, which facilitated the process by dividing it into five separate steps: prompting, labeling prompts, adding reply messages as prompter or assistant, labeling replies, and ranking assistant replies. The dataset was curated with content moderation and spam filtering as key components of the annotation pipeline, ensuring high quality and safety standards.

Volunteers completed over 625,000 tasks in total, resulting in the collection of over 10,000 fully annotated and filtered Conversation Trees. We hope the resulting dataset will be an important resource for researchers studying natural language processing and machine learning, as it allows for the development and testing of new algorithms and models for conversational AI. By providing such a large and diverse dataset, the Open Assistant Conversations dataset opens up new avenues of research in the field, enabling researchers to explore the complexities of human language and interactions in ways that were not possible before.

Example User Interface (UI) displays of the data collection platform can be found in Appendix B. In the following sections, we provide more details regarding the various aspects of the data collection pipeline.

3.1 Single-Step Collection

The process of data collection in this study is structured to be both efficient and effective by breaking the work down into single units and advancing multiple conversation trees one step at a time. This approach minimizes data loss due to user attrition and ensures that every unit of work is captured for utilization.

The users are presented with a range of task types, either by choice or through random sampling (weighted according to current requirements). The task types include creating prompts, replying as an assistant, replying as a prompter, labeling prompts or replies, and ranking prompter or assistant replies.

  • Create a prompt: Users are required to write an initial prompt that forms the root of a new conversation tree. A lottery system is employed to manage the selection of new prompts, with only a fixed number of prompts being chosen for continuation at any given moment.
  • Reply as an assistant: Replying as an assistant is a more labor-intensive task that necessitates users to carefully consider their responses and often engage in external research to provide a helpful and relevant answer to the prompter’s request. A reward system has been implemented to incentivize users to participate.
  • Reply as a prompter: The task of replying as a prompter does not impose strict quality requirements but instead emphasizes the importance of diversity to accommodate various use-cases. Users are responsible for identifying any potential issues with the initial prompt, such as spam or content that violates the established guidelines. If the provided labels indicate that the tree contains spam or unsuitable content, it is transitioned to the aborted low-grade state and subsequently removed from the dataset. Conversely, if the tree passes the initial prompt review state, it proceeds to the growing state.
  • Label a prompt or reply: Users are presented with a message from the database along with the preceding conversation thread (if available) and are asked to categorize the message according to three dimensions: spam detection, guideline adherence, and quality.
  • Rank assistant replies: Users are presented with two or more responses to the same parent message and asked to rank them in order of preference.

In summary, this data collection methodology effectively divides work into single units, minimizes data loss due to user attrition, and captures valuable information for future analysis and application. By offering users a diverse range of task types, the study encourages active participation and ensures the collection of rich and varied data for a comprehensive understanding of the subject.

3.2 Message Tree State Machine

The tree state machine serves as a systematic approach to managing the progression of message trees throughout the data collection process. This method ensures that each tree undergoes a series of states until it reaches completion, beginning with the creation of new trees by randomly sampling from the pool of initial prompts.

The various states that a message tree passes through include the initial prompt review state, growing state, and end state, as well as the aborted low-grade state for trees that are deemed unsuitable for inclusion in the dataset.

Upon the creation of a new tree, it enters the initial prompt review state, where multiple users are tasked with providing labels to assess its quality and suitability. This state plays a crucial role in Identifying any potential issues with the initial prompt, such as spam or content that violates the established guidelines. If the provided labels indicate that the tree contains spam or unsuitable content, it is transitioned to the aborted low-grade state and subsequently removed from the dataset. Conversely, if the tree passes the initial prompt review state, it proceeds to the growing state. The growing state involves the continuous issuance of tasks to users, such as providing replies, labels, and rankings, to facilitate the development and expansion of the conversation tree. This state is essential for collecting diverse and rich data, as it allows for the accumulation of multiple interactions and the exploration of various conversation paths, given the same initial prompt. The growing state continues until the tree reaches its end state, which is defined by a maximum number of messages or other predetermined criteria. Parameters within the data collection platform govern the behavior of the tree state machine, such as the average number of messages collected for each parent message or the maximum tree depth. These parameters enable researchers to fine-tune the data collection process according to their specific research goals and requirements, ensuring a more targeted and efficient approach to gathering data. Parameters varied during the collection of the dataset. Current settings can be found in Appendix F. In summary, the tree state machine is a structured and systematic method for managing the progression of message trees during the data collection process. By guiding each tree through a series of states, from initial prompt review to growing and reaching its end state, the tree state machine ensures the collection of high-quality, diverse, and relevant data. Additionally, the inclusion of platform parameters allows for the customization of the data collection process to align with specific research objectives, further enhancing the effectiveness and utility of this approach.

3.3 Ranking Merging

Reinforcement learning from human feedback (RLHF) comprises a set of techniques that all aim to optimize the output distribution of a language model using the preference structure provided by human rankers. To get a preference structure that is well aligned to users, we cannot just rely on the opinions of individual rankers, due to the high variance in human preferences. Since our objective is to collect data for a generally capable digital assistant, every ranking of possible responses is performed by K independent rankers.

Once this is done, we need to fuse these K individual opinions into one consensus opinion usable in training preference models. We perform this preference fusion by treating it as a ranked-voting problem, whose objective is to maintain the preferences as faithfully as possible. The method chosen for this is known as “ranked pairs” or “Tideman’s method.” Simplified, this method creates a sorted list of “winners” according to the strength of the preference of one element over the others.

3.4 Contributor Guidelines

To achieve a high degree of quality and consistency across a wide range of contributors, we issue clear and detailed guidelines. A full copy of these guidelines at the present time can be found in Appendix A.

Our guidelines follow three main goals:

  1. Clarify the meanings, scales, and criteria for assigning labels and rankings during the labeling and ranking tasks.
  2. Make assistant responses be polite, helpful, concise, friendly, and safety-aware.
  3. Instruct prompts and prompter replies to explore a diverse and challenging set of inputs to the assistant role.

3.5 Quality Control & Content Moderation

We take a multi-pronged approach to quality assurance, with the main pillars being a system of reward points & leaderboards, and manual review of flagged content by human moderators. This both maximizes the quality of contributions while effectively utilizing the limited time of the volunteer moderators.

In an effort to demonstrate progress and achievement to users, and to encourage high-quality contributions, our system allocates points for the completion of tasks. These points contribute to various leaderboards, including daily, weekly, monthly, and all-time rankings. A level system also exists, wherein higher point accumulation results in elevated levels, reflecting veteran status and engagement. In the future, this system could potentially be developed further to facilitate preferential access to more engaging tasks or similar perks.

The distribution of points is contingent upon task type, as certain tasks require greater effort, such as the reply as assistant task (compared to the create a prompt task). A significant portion of points is deferred and reliant on interactions with other users. For instance, a user’s assistant reply may gather many additional points if it is subsequently deemed non-spam and highly ranked by other users. Inversely, points may be reduced or lost for answers that are labeled as spam or down-voted by consensus of other users.

Within the moderator section of the website, an alternative leaderboard, designated the Trollboard, is exhibited. This leaderboard assesses users based on an aggregate of negative labels, reports, and down-votes received for their contributions. This approach enables human moderators to proactively scrutinize potentially misbehaving users in a comprehensive manner. The Trollboard has proven to be an effective tool in addressing the numerical disparity between users and moderators, maximizing the collective efforts of contributors to identify undesirable contributions.

Users further have the option to report messages to moderators for manual review, either via the platform or directly via communication on a community chat server. Moderators have the ability to delete individual messages or all messages of a given user at their own discretion. Deleted messages are retained but marked as deleted and not exported for training.

4 Dataset Composition

We release several variants of the Open Assistant Conversations dataset representing various levels of filtering. The full dataset consists of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. This includes 8,576 synthetic messages, leaving 152,867 human-submitted messages. Of the 66,497 total conversation trees, we consider 10,968 complete, meaning the full number of messages has been collected and the moderation process for these trees has been concluded. These completed trees contain 92,365 messages.

The set of categories for which Likert-scale human labels are collected is Creativity, Quality, Humor, Helpfulness, Violence, and Rudeness. The set of categories for which binary human labels are collected is Language Mismatch, Not Appropriate, Personally Identifiable Information, Hate Speech, and Sexual Content. We additionally release the rank of each assistant message compared to other assistant messages submitted for the same prompt, computed from the preference rankings of several human annotators.

Of the 161,443 total messages, 69,614 are assistant replies, and 91,829 are user prompts. Related to this, 52,159 conversation trees consist of only a single initial user prompt which has not yet received any assistant replies. The dataset is dominated by English and Spanish messages as illustrated in Figure 2. The prominence of English is expected as a result of the community around Open Assistant originating in the English-speaking open-source machine learning community. The high quantity of Spanish messages can be attributed to the publicity given to Open Assistant by prominent figures in the Spanish machine learning community.

Previous: Model | FLAN, Scaling Instruction-Finetuned Language Models** Next: Model | Guanaco

post contain ""

    No matching posts found containing ""