00:00:00

Share Your Feedback 🏝️

POST | LLM Training

POST | LLM Training

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: POST | Tokenizer Next: Model | FLAN, Scaling Instruction-Finetuned Language Models**

POST | LLM Training

  • Related Project: private
  • Category: Paper Review
  • Date: 2023-05

Contents


Efficient Fine-Tuning

LoRA

Release Date: 2021.06

  • Reduces trainable parameters by 10,000 times
  • No additional inference latency
  • Performs on-par or better than full fine-tuning
Learn More >
QLoRA

Release Date: 2023.05

  • Fine-tunes 65B model on a single 48GB GPU
  • Uses 4-bit quantization
  • Preserves full 16-bit fine-tuning performance
Learn More >
Prefix-Tuning

Release Date: 2021.01

  • Optimizes only 0.1% of parameters
  • Keeps language model parameters frozen
  • Outperforms fine-tuning in low-data settings
Learn More >
GPT Understands, Too (P-Tuning)

Release Date: 2021.03

  • Combines discrete prompts with trainable continuous prompt embeddings
  • Improves performance on a wide range of NLU tasks
  • Effective for both fully-supervised and few-shot settings
Learn More >
P-Tuning v2

Release Date: 2021.10

  • Comparable performance to full fine-tuning
  • Effective for models with 0.3B to 10B parameters
  • Tunes less than 0.1% of parameters
Learn More >
8-bit Optimizers via Block-wise Quantization

Release Date: 2021.10

  • Maintains 32-bit performance with 8-bit statistics
  • Greatly decreases computational and storage costs
  • Applicable to various tasks without hyperparameter changes
Learn More >
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Release Date: 2022.08

  • Cuts inference memory by half while retaining full precision
  • Enables immediate use of 175B parameter models after conversion
  • Makes large language models more accessible on consumer GPUs
Learn More >
Cramming: Training a Language Model on a Single GPU in One Day

Release Date: 2022.12

  • Achieves BERT-like performance with single GPU in one day
  • Provides modified pipeline for efficient small-scale training
  • Demonstrates scaling laws hold even in constrained settings
Learn More >

Quantization and Efficiency

GPTQ

Release Date: 2022.10

  • One-shot weight quantization method
  • Quantizes 175B models in ~4 GPU hours
  • Enables 175B model inference on a single GPU
Learn More >
GGML

Release Date: N/A

  • Foundational C library for ML applications
  • Supports quantization for consumer hardware
  • Provides specialized binary format for LLMs
Learn More >

Attention Mechanisms

Flash Attention

Release Date: 2022.05

  • Reduces memory complexity of attention
  • Speeds up training and inference
  • Enables longer context in transformer models
Learn More >
ALiBi

Release Date: 2021.08

  • Attention with Linear Biases
  • Enables extrapolation to longer sequences
  • No positional embeddings required
Learn More >
RoPE

Release Date: 2021.04

  • Rotary Position Embedding
  • Improves relative position modeling
  • Enables flexible sequence length adaptation
Learn More >
Flash Attention 2

Release Date: 2023.05

  • Further improves on Flash Attention
  • Up to 9x speedup over standard attention
  • Supports all attention variants (e.g., ALiBi, RoPE)
Learn More >

Training Frameworks

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Release Date: Ongoing (GitHub repository)

  • Enables efficient adaptation of pre-trained language models
  • Fine-tunes only a small number of model parameters
  • Supports multiple PEFT techniques (LoRA, Prefix Tuning, P-Tuning, etc.)
Learn More >
TRL - Transformer Reinforcement Learning

Release Date: Ongoing (GitHub repository)

  • Comprehensive toolkit for fine-tuning and aligning transformer models
  • Supports various methods including SFT, RM, PPO, and DPO
  • Offers scalable training capabilities and integration with advanced tools
Learn More >
DeepSpeed

Release Date: N/A

  • 15x speedup for training ChatGPT-like models
  • Incorporates DeepNVMe for I/O optimizations
  • Supports efficient large-scale distributed training
Learn More >
MeZO

Release Date: 2023.05

  • Fine-tuning with only forward passes
  • 12x memory reduction compared to backpropagation
  • Compatible with LoRA and prefix tuning
Learn More >

8-bit Optimizers via Block-wise Quantization

  • url: https://arxiv.org/abs/2110.02861
  • pdf: https://arxiv.org/pdf/2110.02861
  • abstract: Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

  • url: https://arxiv.org/abs/2208.07339
  • pdf: https://arxiv.org/pdf/2208.07339
  • abstract: Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

Fine-Tuning Language Models with Just Forward Passes

  • url: https://arxiv.org/abs/2305.17333
  • pdf: https://arxiv.org/pdf/2305.17333
  • github: https://github.com/IST-DASLab/gptq
  • abstract: Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

Cramming: Training a Language Model on a Single GPU in One Day

  • url: https://arxiv.org/abs/2212.14034
  • pdf: https://arxiv.org/pdf/2212.14034
  • abstract: Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

  • url: https://arxiv.org/abs/2101.00190
  • pdf: https://arxiv.org/pdf/2101.00190
  • abstract: Fine-tuning large pre-trained language models on downstream tasks is parameter-inefficient. We introduce prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks. Prefix-tuning keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics that are unseen during training.

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

  • url: https://arxiv.org/abs/2110.07602
  • pdf: https://arxiv.org/pdf/2110.07602
  • abstract: Prompt tuning (Lester et al., 2021) has been shown to be a promising parameter-efficient method in adapting large language models (LLMs) for downstream tasks. However, existing research on prompt tuning has been limited to LLMs with hundreds of billions of parameters. For LLMs with parameters ranging from a few hundred million to tens of billions, prompt tuning is significantly outperformed by finetuning. In this paper, we propose P-Tuning v2, an improved prompt tuning strategy that performs comparably or even better than fine-tuning on models of various scales and tasks. We demonstrate that P-Tuning v2 outperforms fine-tuning on models with 0.3B to 10B parameters on natural language understanding (NLU) tasks. We also show that P-Tuning v2 is more parameter-efficient than fine-tuning, as it only needs to tune less than 0.1% of the parameters to achieve comparable performance. Our approach has important implications for efficient adaptation of LLMs in real-world scenarios.

GPT Understands, Too (P-Tuning)

  • url: https://arxiv.org/abs/2103.10385
  • pdf: https://arxiv.org/pdf/2103.10385
  • abstract: Prompting a pretrained language model with natural language patterns has been proved effective for natural language understanding (NLU). However, our preliminary study reveals that manual discrete prompts often lead to unstable performance – e.g., changing a single word in the prompt might result in substantial performance drop. We propose a novel method P-Tuning that employs trainable continuous prompt embeddings in concatenation with discrete prompts. Empirically, P-Tuning not only stabilizes training by minimizing the gap between various discrete prompts, but also improves performance by a sizeable margin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is generally effective for both frozen and tuned language models, under both the fully-supervised and few-shot settings.

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

  • url: https://arxiv.org/abs/2110.07602
  • pdf: https://arxiv.org/pdf/2110.07602
  • abstract: Prompt tuning (Lester et al., 2021) has been shown to be a promising parameter-efficient method in adapting large language models (LLMs) for downstream tasks. However, existing research on prompt tuning has been limited to LLMs with hundreds of billions of parameters. For LLMs with parameters ranging from a few hundred million to tens of billions, prompt tuning is significantly outperformed by finetuning. In this paper, we propose P-Tuning v2, an improved prompt tuning strategy that performs comparably or even better than fine-tuning on models of various scales and tasks. We demonstrate that P-Tuning v2 outperforms fine-tuning on models with 0.3B to 10B parameters on natural language understanding (NLU) tasks. We also show that P-Tuning v2 is more parameter-efficient than fine-tuning, as it only needs to tune less than 0.1% of the parameters to achieve comparable performance. Our approach has important implications for efficient adaptation of LLMs in real-world scenarios.

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

  • url: https://arxiv.org/abs/2304.01933
  • pdf: https://arxiv.org/pdf/2304.01933
  • abstract: The success of large language models (LLMs), like GPT-3 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by fine-tuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, OPT, and GPT-J, as well as widely used adapters such as Series adapter, Parallel adapter, and LoRA. The framework is designed to be research-friendly, efficient, modular, and extendable, allowing the integration of new adapters and the evaluation of them with new and larger-scale LLMs. Furthermore, to evaluate the effectiveness of adapters in LLMs-Adapters, we conduct experiments on six math reasoning datasets. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to that of powerful LLMs (175B) in zero-shot inference on simple math reasoning datasets. Overall, we provide a promising framework for fine-tuning large LLMs on downstream tasks. We believe the proposed LLMs-Adapters will advance adapter-based PEFT research, facilitate the deployment of research pipelines, and enable practical applications to real-world systems.
  • remark: for checking references

LoRA: Low-Rank Adaptation of Large Language Models

  • url: https://arxiv.org/abs/2106.09685
  • pdf: https://arxiv.org/pdf/2106.09685
  • github: https://github.com/microsoft/LoRA
  • abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example - deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL.

QLORA:EfficientFinetuningofQuantizedLLMs

  • year: 2023
  • abstract: We present QLoRA, an efficient fine-tuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of fine-tuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular fine-tuning (e.g. 33B and 65B parameter models). Our results show that QLoRA fine-tuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. We release all of our models and code, including CUDA kernels for 4-bit training.
  • paperswithcode: https://paperswithcode.com/paper/qlora-efficient-fine-tuning-of-quantized-TextGenerationLLMs
  • github: https://github.com/artidoro/qlora

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

  • url: https://arxiv.org/abs/2210.17323
  • pdf: https://arxiv.org/pdf/2210.17323
  • github: https://github.com/IST-DASLab/gptq
  • abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at this https URL.

GGML

  • github: https://github.com/IST-DASLab/gptq
  • abstract: GGML, introduced by Georgi Gerganov, is a foundational C library that facilitates machine learning applications by providing low-level primitives, a specialized binary format for large language models (LLMs), and Rust bindings for safe, idiomatic access. Central to GGML is the technique of quantization, which optimizes LLMs for consumer hardware by reducing the precision of model weights, thus making advanced computational models more accessible. GGML supports dynamic evolution through its versioning system which integrates enhancements like vocabulary scoring and memory-mapping to boost performance without compromising backward compatibility. This system meticulously organizes LLM components—hyperparameters, vocabulary, and weights—into a structured binary format. Hyperparameters configure model behaviors, vocabularies consist of tokens that aggregate to form language, and weights are organized in layers within tensors to define the model’s architecture. Each version iteration aims to refine these elements to enhance functionality and efficiency, demonstrating GGML’s commitment to democratizing cutting-edge machine learning technologies.

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods


TRL - Transformer Reinforcement Learning

  • github: https://github.com/huggingface/trl
  • abstract: The TRL (Transformer Reinforcement Learning) library represents a comprehensive full-stack toolkit designed to fine-tune and align transformer-based language and diffusion models. Built atop the widely-used transformers library, TRL leverages a variety of advanced methodologies including Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO) to enhance model performance. Key features include scalable model training capabilities across devices from single GPUs to multi-node clusters, enabled by technologies such as DDP, DeepSpeed, and quantization methods like LoRA and QLoRA. Furthermore, TRL integrates the unsloth tool for expedited training and offers CLI capabilities for user-friendly, code-free interaction with models. The library supports a range of fine-tuning classes and introduces value head extensions to models for reinforcement learning applications. TRL’s usability is highlighted through practical examples such as fine-tuning GPT models for positivity in movie reviews or reducing toxicity, providing users with flexible, powerful tools for developing customized, aligned AI models.

MicroSoft DeepSpped

  • github: https://github.com/microsoft/DeepSpeed
  • abstract: In the realm of deep learning (DL) optimization, the innovative advancements introduced by DeepSpeed have set new benchmarks for training and deploying large language models (LLMs) such as ChatGPT. Notably, DeepSpeed’s latest development enables the training of ChatGPT-like models with a remarkable 15x speedup compared to current state-of-the-art RLHF systems. This enhancement is achieved through a single-click deployment that drastically reduces costs across various scales of operation. DeepSpeed’s suite incorporates cutting-edge technologies such as DeepNVMe for I/O optimizations, and advanced checkpointing methods that support efficient and flexible operations during large-scale distributed training. Additionally, the integration of DeepSpeed with Windows platforms and its support for various hardware configurations underline its adaptability and broad application potential. By providing scalable solutions that facilitate the rapid and cost-effective training of LLMs, DeepSpeed is poised to accelerate the adoption and innovation of AI capabilities across diverse sectors.
Previous: POST | Tokenizer Next: Model | FLAN, Scaling Instruction-Finetuned Language Models**

post contain ""

    No matching posts found containing ""