url: https://arxiv.org/abs/2407.08296 pdf: https://arxiv.org/pdf/2407.08296 html: https://arxiv.org/html/2407.08296v1 abstract: Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.
[LoRA의 응용 관련 색인마킹]
Contents
1. 서론
최근의 연구에서는 대규모 언어모델(Large Language Models, LLMs)이 각종 학문 분야에서 향상된 성과를 보여주고 있습니다. 그러나 수십억 개의 파라미터를 포함하는 이런 대규모 언어모델을 학습하고, 정밀 조정하는 것은 많은 컴퓨팅 리소스가 필요하므로 어려운 과제입니다.
예를 들어, Meta의 LLaMA 모델은 약 5개월 동안 2048개의 A100-80GB GPU로 개발되었으며, 단일 배치로 LLaMA 7B 모델을 처음부터 전처리하는 데는 최소 58GB의 메모리가 필요합니다.
2. 관련 작업
2.1 Low-Rank 적용 및 훈련
LLMs를 최적화하는 것은 가중치, 활성화, 기울기 및 최적화 상태를 수용할 수 있는 상당한 메모리 발자국을 필요로 합니다. LoRA와 같은 기술은 각 계층에 대해 Low-Rank 가중치 어댑터를 도입하여 메모리 발자국을 줄이는 것이 특징입니다. 이후의 개선 사항은 주로 파인튜닝 시나리오에 초점을 맞추고 있습니다. GaLore는 기울기의 Low-Rank 특성을 활용하여 전체 파라미터 학습을 가능하게 하면서 최적화 동안 메모리 사용을 크게 줄입니다.
2.2 저정밀(Low-Precision) 훈련
저정밀 훈련은 데이터를 저정밀 형식으로 저장하고 저정밀 일반 행렬 곱셈(General Matrix Multiplication, GEMM) 연산을 활용하여 훈련 효율성을 향상시키는 것을 목표로 합니다. 저정밀 훈련의 주요 챌린지는 훈련 과정 중 잠재적인 불안정성입니다.
3. 방법
양자화는 PTQ(Post Training Quantization)와 QAT(Quantization Aware Training)로 분류될 수 있습니다. Q-GaLore에서는 모델 가중치를 INT8로 재훈련하고, 활성화 및 기울기는 BFloat16으로 계산됩니다. Q-GaLore는 SVD 작업의 빈도를 동적으로 업데이트하여 계산 시간을 절약합니다. 그래디언트 부공간의 동적 업데이트 전략은 연산 비용을 상당히 줄이며, 이는 Q-GaLore의 주요 혁신 중 하나입니다. 투영 행렬의 양자화 내성을 실험적으로 검증하였고, 이를 통해 메모리 비용을 추가로 절감할 수 있었습니다. Low-Rank 훈련 방법을 사용할 때, 모델 파라미터를 유지하는 데 필요한 메모리가 대부분의 메모리 오버헤드를 차지합니다. 확률적 반올림을 사용하여 소수 그래디언트 기여를 누적합니다.
3.1 양자화의 기초
양자화 방법은 두 가지로 분류됩니다.
상기 방법들은 고정밀 파라미터를 유지하면서 전방 및 후방 패스 동안 저정밀 데이터 형식으로 파라미터를 전환하여 양자화를 적용합니다.
Q-GaLore에서는 모델 가중치를 INT8로 재훈련하고 활성화와 기울기는 BFloat16에서 계산하게 됩니다. INT8과 비교하여 더 큰 표현력을 제공하는 FP8은 NVIDIA Hopper 시리즈 GPU와 같이 제한된 하드웨어에서만 지원되므로, 더 일반적인 사용을 위해 INT8 형식을 사용합니다. 데이터 Precision를 변환하기 위해 블록 단위의 일괄 양자화를 사용합니다.
\[W_q = \text{Quant}_n(W, s, z) = \text{clamp}\left(\left\lfloor \frac{W}{s} + z \right\rceil, -2^{n-1}, 2^{n-1} - 1\right)\]3.2 그래디언트 부공간의 층별 수렴 행동
GaLore는 모든 층의 훈련 state가 동일하다고 가정하고 일정 간격으로 그래디언트 공간과 투영 행렬을 재계산합니다. 이는 비용이 많이 드는 SVD 계산을 자주 요구한다는 의미입니다. LLaMa-130M의 사전 훈련 동안 정기적인 간격으로 획득된 투영 행렬의 코사인 유사도를 조사했으며, 발견한 관찰 내용은 다음과 같습니다.
이 관찰을 바탕으로, 특정 층의 SVD 빈도를 동적으로 업데이트하는 것이 가능합니다. 특정 층 \(l\)에 대해 SVD 간격 \(t\)를 시작으로, 이전 \(k\) 간격 동안의 투영 행렬의 코사인 유사도를 모니터링합니다. 만약 $ k $ 간격 동안의 코사인 유사도가 임계값(e.g., $\geq 40\%$) 이상으로 유지되면, 계산을 줄이기 위해 간격을 $(t \rightarrow 2 \times t)$로 업데이트합니다. 이 적응적 업데이트(adaptive update)는 기존 GaLore의 성능을 잘 모방할 수 있으며, 비용이 많이 드는 SVD 호출을 60% 이상 줄일 수 있습니다.
3.3 투영 행렬(Projection Matrix)의 높은 양자화 내성
적응적 수렴 특성은 투영 행렬이 일정 수준의 중복성을 가지고 있음을 시사하며, 이는 높은 정확도가 필수적이지 않음을 의미합니다. 이런 관찰에 영감을 받아, 투영 행렬의 기능을 양자화 조건하에서 추가로 조사했습니다. 모든 층에 걸쳐 일정한 블록 크기 256을 유지하며 투영 행렬에 대한 블록 단위 양자화를 구현했습니다. 이 실험을 통해 투영 행렬이 양자화에 강하며, 4비트로 줄여도 사전 훈련 품질에 거의 영향을 미치지 않음을 보여줍니다. 이런 결과에 따라 투영 행렬을 4비트로 제한하여 Low-Rank 훈련에서 최적화 상태의 메모리 비용을 25% 추가로 줄였습니다.
3.4 고정밀 훈련 궤적 근사를 위한 확률적 반올림
Low-Rank 훈련 방법을 사용할 때, 모델 파라미터를 유지하는 데 필요한 메모리가 대부분의 메모리 오버헤드를 차지합니다. 따라서 훈련 동안 메모리 효율을 향상시키기 위해 가중치를 저정밀로 유지합니다. 저정밀 파라미터로 훈련할 때의 주요 챌린지는 기울기 정보의 상당한 감소입니다. 각 최적화 단계에서 고정밀 기울기는 저정밀 가중치 업데이트로 양자화되어야 합니다. 이 문제를 해결하기 위해 확률적 반올림(SR)을 사용합니다.
\[W_q = F_{SR}(W) = \begin{cases} \lfloor W \rfloor & \text{with probability } p = \lceil W \rceil - W \\ \lceil W \rceil & \text{with probability } p = W - \lfloor W \rfloor \end{cases}\]이 공식 하에서, \(W_q\)의 기대값은 다음과 같습니다.
\[E[W_q] = \lfloor W \rfloor ( \lceil W \rceil - W) + \lceil W \rceil (W - \lfloor W \rfloor) = W\]이런 방식으로, 수식은 \(W\)의 값을 정수로 반올림할 때 발생할 수 있는 편향 없이 \(W\)의 정확한 값을 보존하도록 설계되어 있습니다.
4. 실험
Q-GaLore는 다양한 전처리 및 파인튜닝 작업에서 비교 가능한 성능을 유지하면서 메모리 오버헤드를 크게 줄입니다. 예를 들어, 1B 모델 크기의 실험에서 INT8 가중치를 사용하여 원래 가중치의 메모리 비용을 절반으로 줄였습니다.
Q-GaLore는 7B 모델을 16GB 메모리 내에서 전처리할 수 있는 능력을 보여주며, 이는 새로운 아키텍처나 훈련 방법의 확장성을 평가하는 중요한 부분입니다.
Since the 2020s, Large Language Models (LLMs) have demonstrated remarkable performance in various disciplines [2, 3, 4, 5, 6, 7]. However, the immense scale of LLMs, often comprising billions of parameters, presents a formidable challenge for most research groups in terms of training and full fine-tuning. For example, Meta’s LLaMA models were developed with 2048 A100-80GB GPUs for approximately a period of 5 months [8]. Even without factoring in any considerations for product efficiency, pre-training a LLaMA 7B model from scratch with a single batch size necessitates a minimum of 58 GB memory. This breakdown comprises 14 GB for trainable parameters, 42 GB for Adam optimizer states and weight gradients, and 2 GB for activation [1].
Numerous research efforts have been dedicated to alleviating the substantial costs associated with training LLMs. These endeavors encompass a range of techniques, including small-scale LLM designing [9, 10], efficient scaling optima [11], training methodologies incorporating sparsity [12, 13, 14], sparse model training approaches [15, 16], and low-rank training strategies [17, 1]. Among these, GaLore [1] has emerged as a notable contender, enabling the full-parameter training of LLMs through low-rank gradient updates achieved via Singular Value Decomposition (SVD). Leveraging its low-rank characteristics, GaLore offers a significant reduction—up to 63.3%—in total training memory requirements, facilitating the training of a 7B model with a mere 24GB of memory.
Although GaLore offers substantial memory savings, its 24GB memory requirement still surpasses the available resources in many customer devices. For instance, popular laptop GPUs like the RTX 4060 Ti are equipped with up to 16GB of memory. This limitation raises the question of how we can further reduce the memory footprint of low-rank LLM training to make it accessible to a wider range of hardware configurations. Also, GaLore requires regular updates to the gradient subspace through computationally expensive SVD operations (e.g., every 200 iterations) to approximate the training trajectory of full-rank training. The computational complexity of SVD operations is roughly on the magnitude of O(mn2), where m and n are the dimensions of the matrix. As a result, it takes ∼ 10 minutes for the LLaMA-7B model to update the subspace, leading to significant training latency.
To address these challenges, we delved into the training dynamics of the gradient subspace of GaLore and discovered two intriguing phenomena:
Figure 1: Comparison of data types and training flows of different methods. We by default use 8-bits Adam [18] as the inner optimizer. Note that the gradient in GaLore and Q-GaLore is not persistent during training, following the same strategy in [19, 20].
Inspired by these observations, we propose Q-GaLore, a novel approach that enables the training of large language models with low-precision weights and low-rank gradients. Q-GaLore introduces two modules to reduce memory overhead and training latency: - (i) Low precision training with low-rank gradients: We manage to quantize the entire model (not only the optimizer state as in GaLore [1]) to 8-bits and the projection matrix to 4-bits, as shown in Figure 1. By utilizing low-precision weights and projection matrices, our approach achieves a reduction of approximately 28.57% in memory requirements for gradient low-rank training where the weight represent the primary component of memory usage post low-rank projection. Additionally, to maintain training stability and approximate the trajectory of high-precision training, we implement Stochastic Rounding (SR) [21] that provides an unbiased estimation of the gradient trajectory and mitigates gradient information loss, thus enhance the training stability and overall performance. - (ii) Lazy layer-wise subspace exploration: We monitor the convergence levels of the gradient subspace in different layers and adaptively decrease the frequency of SVD operations for the layers whose low-rank subspace does not change significantly over time. This approach reduces the training time associated with SVD, saving over 32 hours for training a 7B model.
We demonstrate the efficacy of Q-GaLore in both pre-training and fine-tuning scenarios. For pretraining, Q-GaLore’s efficiency allows us to reduce the memory requirements of full-rank training and GaLore by 61% and 30%, respectively, across various model sizes from 60M to 7B. Notably,
Optimizing Large Language Models (LLMs) requires a substantial memory footprint to accommodate weights, activations, gradients, and optimization states. Low-Rank Adaptation (LoRA) [22] is a notable technique that introduces low-rank weight adapters for each layer, reducing the memory footprint by only optimizing the adapters, which can later be merged back into the original model. Subsequent enhancements to LoRA, such as quantization [23], multi-task learning support [24], and various architectural improvements [25, 26, 27, 28, 29, 30, 31, 32, 30], have all focused on fine-tuning scenarios. Despite the efficiency of low-rank adaptation, its suboptimal performance compared to full parameter optimization [33] has motivated the development of other memory-efficient optimization methods. For instance, [19, 20] reduce memory overhead through fused backward operations, eliminating the need to store all weight gradients. Sparse optimization techniques, such as BAdam [34] and LISA [35], partition parameters into blocks or sample layers based on importance to minimize memory costs while maintaining performance comparable to full parameter fine-tuning.
Early efforts to adapt LoRA for pre-training, such as ReLoRA [36], still require full-rank learning in the initial stages, resulting in high memory overhead. Recently, GaLore [1] leverages the low-rank properties of gradients [30] to enable full-parameter learning while significantly reducing memory usage during optimization. This approach allows GaLore to achieve better performance than common low-rank adaptation methods such as LoRA, while still being memory-efficient.
Low-precision training aims to improve training efficiency by storing data in low-precision formats and leveraging low-precision General Matrix Multiplication (GEMM) operations. This is distinct from post-training quantization, which primarily enhances the inference efficiency of pre-trained models. A significant challenge in low-precision training is potential instability during the training process. SWALP [37] addresses this issue using stochastic weight averaging [38], but it requires maintaining averaged weights, leading to high memory overhead in large foundational models. Other methods handle instability by scaling gradients [39] or second-order optimizer statistics [40].
While various low-precision training methods have been explored for smaller-scale convolutional networks [41, 42, 43, 44, 45, 46], they are generally not applicable to training large-scale transformers, as large tensors are less suitable for quantization [47]. Some approaches to low-precision training at a larger scale still require maintaining high-precision latent weights during training, significantly increasing memory consumption for large language models [48, 49]. This study aims to improve the end-to-end memory efficiency of training large-scale foundational model at scale.
We first introduce the data type and quantization basics in Section 3.1. Section 3.2 demonstrates the adaptive convergence properties of the gradient subspace, which facilitates efficient training. In Section 3.3, we demonstrate the high tolerance of the projection matrix to quantization. Section 3.4 then discusses stochastic rounding for approximating high-precision training trajectories. The overall pipeline of Q-GaLore is depicted in Figure 4.
Generally, quantization methods are categorized into Post-Training Quantization (PTQ), where quantization is applied to pretrained models without further training; and Quantization-Aware Training (QAT), which incorporates quantization throughout the training process. QAT aims to either generate more quantizable models for faster inference or expedite the training process through low-precision operations. To preserve performance, these methods retain high-precision parameters throughout the training process and apply quantization to transfer the parameters into low-precision data formats during each forward and backward pass. Maintaining high precision parameters occupis massive memory and results in even larger memory requirements than vanilla high precision training. In this work, we focus on improving the memory efficiency of training large language models and do not maintain the high-precision parameters.
In Q-GaLore, the model weights are retrained in INT8 while activations and gradients are computed in BFloat16. Although FP8 [50] offers greater expressiveness than INT8, it is supported only on limited hardware devices, e.g., the NVIDIA Hopper series GPUs, which are costly and not widely available. Thus, we employ the more general INT8 formats. The pseudocode is presented in the appendix A. To convert data precisions, we utilize block-wise uniform quantization [51]:
\[W_q = \text{Quant}_n(W, s, z) = \text{clamp}\left(\left\lfloor \frac{W}{s} \right\rceil + z, -2^{n-1}, 2^{n-1} - 1\right)\]In this expression:
Figure 2: Cosine similarity between the adjacent projection matrices captured every 250 training iterations.
GaLore relies on a fixed interval to recompute the gradient space and projection matrices blindly, assuming that the training dynamics of all the layers in LLMs remain the same. One direct implication remains the frequent computation of computationally expensive SVD. To this end, we ask: How does the gradient subspace dynamics varies during the pre-training of LLMs? We investigated the cosine similarity across the projection matrices obtained at regular interval during the pre-training of LLaMa-130M as shown in Figure 2. Our observations are as follows:
This observation provides a unique opportunity to monitor the gradient subspace behavior during pre-training and dynamically update the frequency of SVD for each layer if we observe saturation. More specifically, starting with an SVD interval of $t$ for a layer $l$, we monitor the cosine similarity of projection matrices in the previous $k$ intervals. If the cosine similarity across the $k$ intervals remains greater than a threshold (e.g., $\geq 40%$), we update the interval from $(t \rightarrow 2 \times t)$ to reduce the compute. This adaptive lazy update can closely mimic the performance of the original GaLore with over $60%$ reduction in computationally expensive SVD calls. Further ablation studies about the trade-off between SVD calls and performance are presented in Section 4.4.
Figure 3: Pre-training performance on the LLaMA-130M models. The projection matrices are quantized with different bits.
The adaptive convergence properties suggest that the projection matrix has a degree of redundancy, indicating that high accuracy is not essential. This observation inspired us to further investigate the functionality of the projection matrix under quantization conditions. We implemented block-wise quantization for the projection matrices, maintaining a uniform block size of 256 across all layers. During these experiments, we ensured that the update steps for the projection matrices remained constant, allowing us to focus exclusively on their quantization characteristics. Figure 3 illustrates the results for the LLaMA-130M models, demonstrating that the projection matrices are highly resilient to quantization, with minimal impact on pre-training quality even when reduced to 4 bits. Based on these findings, we applied quantization to the projection matrices, restricting them to 4 bits. This approach further reduces the memory cost of the optimizer states in low-rank training by 25%.
When using low-rank training methods such as GaLore, the allocation of memory to maintain model parameters constitutes the majority of the memory overhead. Consequently, we opt to maintain the weights in low precision to enhance memory efficiency during training. The primary challenge of training with low-precision parameters is the significant reduction of gradient information. During each optimization step, the full precision gradient must be quantized to a low precision weight update. However, if the gradient magnitude is not large enough, it will be mitigated via the round-to-nearest scheme. Conventional Quantization-Aware Training (QAT) retains full precision parameters to accumulate small gradient contributions, albeit at the cost of increased memory overhead. To address this issue, we employ Stochastic Rounding (SR) [21, 52, 53], that is formulated as the following:
\[W_q = F_{SR}(W) = \begin{cases} \lfloor W \rfloor & \text{with probability } p = \lceil W \rceil - W \\ \lceil W \rceil & \text{with probability } p = W - \lfloor W \rfloor \end{cases}\] \[\text{Under this formulation, the expected value of } W_q \text{ is } E[W_q] = \lfloor W \rfloor (\lceil W \rceil - W) + \lceil W \rceil (W - \lfloor W \rfloor) = W,\]This representation correctly formats the expressions within LaTeX delimiters, ensuring that the equations are displayed properly in mathematical notation.
Figure 4: Illustration of the training flows for Q-GaLore, where the dotted icon denotes intermediate tensors that do not consistently occupy memory.
The pipeline of Q-GaLore is illustrated in Figure 4. The left section of the figure depicts the computation flows, where only the gradients are maintained in high precision to preserve essential training dynamics information. We employ an 8-bit version of the Adam optimizer [18] as the internal optimizer. During each training iteration, the full-rank gradient is projected into a low-rank format and then incorporated into the optimizer states. To project the gradient into the subspace, we obtain the projection matrix using Singular Value Decomposition (SVD), as described in [1]. The update frequency of the projection matrix is managed through our adaptive update strategy, and the matrix is quantized to 4-bits formats to reduce memory overhead.
Furthermore, after updating the optimizer states, we project the low-rank optimizer states back to full rank and update the parameters. As the weights are consistently maintained at low precision, an additional quantization step is necessary to update the weights. Here, we utilize stochastic rounding to capture the minor-gradient nuances and provide an unbiased estimation of the high-precision weights. Additionally, we employ a fused backward operation as described in [20, 1, 19]. Upon calculating the gradients for a single layer, we promptly update the corresponding optimizer state and weights, subsequently releasing the memory allocated to the gradients.
In this section, we evaluate the effectiveness of Q-GaLore on both pre-training and fine-tuning tasks. In Section 4.1, we detail the implementation of models, tasks, hyperparameters, and baseline approaches. We then demonstrate that Q-GaLore achieves comparable performance on both pretraining and fine-tuning tasks (Section 4.2). Additionally, Sections 4.3 and 4.4 provide end-to-end memory analysis and extensive ablation studies, respectively.
Network Architecture. For the pretraining task, we adopt the LLaMA-based architecture with sizes ranging from 60 million to 7 billion, following the setups from [1, 36]. During downstream experiments, we select various pre-trained models to evaluate the general effectiveness of Q-GaLore, including RoBERTa [54] base, LLaMA-3-8B [55], Gemma-7B [56], and Mistral-7B [57].
Pre-Training. We pre-train the LLaMA models on C4 dataset [58]. The C4 dataset is a massive collection of Common Crawl’s web crawl corpus, meticulously filtered and cleaned to ensure highquality language modeling and training. It is widely used for pre-training large language models due to its diverse and extensive textual content. We train the models on this sufficiently large dataset without data repetition and scales the model size up to 7 billion parameters, a crucial experiment for demonstrating the effectiveness of the proposed methods for practical pre-training.
Fine-Tuning. The downstream tasks cover two categories:
Baselines. We consider five baseline methods for comparison:
(ii) Low-Rank: The original weights are factorized into low-rank components:
\[W = UV,\]and $U$ and $V$ are optimized via Adam [62].
(iii) LoRA: LoRA [22] introduces low-rank adaptors for training the models,
\[W = W_0 + UV,\]where $W_0$ is the pretrained weights, which are frozen during training. We use the initialized weight as $W_0$ during pretraining and only optimize $U$ and $V$. And we default to 32 for LoRA alpha and 0.05 for LoRA dropout.
(iv) ReLoRA: ReLoRA [36] enhances the original LoRA methods for better pre-training. ReLoRA is a stage-wise LoRA that periodically merges $UV$ into the original $W$ and initializes a new $UV$ for continued training.
(v) QLoRA [23]: we use the same hyperparameters: 32 for QLoRA alpha and 0.05 for QLoRA dropout. We keep the base models in 8bits for fair comparison.
We pre-trained the LLaMA-based models from scratch on the C4 dataset using various memoryefficient methods. The experiments encompassed different model sizes ranging from 60 million to 1 billion parameters, with results reported in Table 1. In each experiment, we report the perplexity values obtained on the validation set. As the primary memory savings are derived from compressing the weight and optimizer states, we provide estimates of the memory overhead associated with storing these components. Detailed discussions on end-to-end memory measurements and throughput comparisons are provided in Section 4.3. For fair comparison, we used the same low-rank dimensions for all the memory-efficient approaches, specifically {128, 256, 256, and 512} for {60M, 130M, 350M, and 1B} models, respectively. And we use 16-bits Adam as the inner optimizer inside GaLore while Q-GaLore implements 8-bit Adam optimizer.
Table 1: Comparison results of various memory-efficient algorithms on pre-training tasks. Experiments are conducted on C4 dataset with LLaMA models. For each experiment, we report both the perplexity and estimated memory. The estimated memory only count for the weights and optimizer states which cost the majority memory overhead. We follow the same settings and collect the results of all baseline methods from [1], where the training tokens are {1.1B, 2.2B, 6.4B, 13.1B} for {60M, 130M, 350M, 1B} models, respectively.
Incorporating adaptive subspace updating, projection and weight quantization, and stochastic rounding, our Q-GaLore method maintains comparable pre-training performance (with less than a 0.84 perplexity increase, compared with the original GaLore approach) while significantly reducing memory overhead. For example, in the experiment of 1 billion model size, training with INT8 weights halved the original memory cost for weights and achieved a 29.68% memory saving against the original GaLore method and a 60.51% memory saving compared to the Full baseline. Compared to GaLore, the additional memory savings primarily come from two sources:
Since the scaling ability of LLMs is a key demand, experiments at the size of 7 billion models serve as an essential part for evaluating the scalability of new architectures or training methods. Thus we pre-trained a 7B LLaMA model from scratch on the C4 dataset. Given the tremendous computational cost of pre-training 7B models, we currently only trained the models for 40K steps, resulting in a higher perplexity than 1B. The models are still under training towards 150k steps and we will update the number once available.
Table 2: Results of pre-training LLaMA-7B model on C4 dataset. Baseline results are obtained from [1].
We compared the original 8-bit Adam, 8-bit GaLore (GaLore with 8-bit Adam), and our Q-GaLore method. Note that both 8-bit Adam and 8-bit GaLore only restore the optimization states in 8 bits, leaving the weight and projection matrices in 16 bits while our Q-GaLoremaintains the weight in INT8 and projection matrices in INT4 data format. From Table 2, we can observe that our method achieved matching performance, with a perplexity difference of less than 1. To enhance training stability with low-precision weights, we opted for a reduced maximum learning rate, setting it at 0.004 compared to the baseline’s 0.005. This marginally lower learning rate potentially slows the convergence speed during the initial stages, although the difference in perplexity relative to the baseline is negligible. Notably, our approach not only achieves comparable performance, but requires only around 15GB of memory overhead. This efficiency enabled the pre-training experiments to be conducted on a single Nvidia RTX 4060 Ti, which has a 16GB memory budget.
Pre-training LLMs is a resource-intensive task that is typically only feasible for large companies or computing centers. In most practical scenarios, memory-efficient fine-tuning of LLMs on specific downstream tasks is more common. To evaluate the effectiveness of Q-GaLore, we selected a diverse set of downstream tasks, including eight tasks from the GLUE benchmark and four subtasks from MMLU, which assess the ability of LLMs to understand natural language. We compared the performance of Q-GaLore with the baseline Full method and three state-of-the-art low-rank optimization approaches: LoRA, GaLore and QLoRA. It is important to note that while GaLore utilizes a 16-bit Adam optimizer, Q-GaLore employs an 8-bit Adam optimizer, further reducing memory requirements without compromising performance.
Table 3: Comparison results of various memory-efficient fine-tuning algorithms on MMLU tasks. Note that the reported memory stands for the estimated memory overhead for weights and optimizer states. End-to-end memory measurements are discussed at Section 4.3.
Tables 3 and 4 lead to consistent observations: (i) Q-GaLore achieves performance comparable to the full fine-tuning baseline across different models (LLaMA-3-8B, Gemma-7B, Mistral-7B, and RoBERTa-base), with a minimal performance gap of less than 0.65 compared to Full; (ii) Q-GaLore demonstrates comparable or even superior performance compared to LoRA, with a improvement of 1.02 performance gain on the MMLU benchmark of Gemma-7B while also requiring less memory; (iii) Compared with QLoRA, Q-GaLore demonstrates consistent (up to 5.19) gains of performance across architectures and tasks, at the same memory costs.
Table 4: Comparison results of various memory-efficient fine-tuning algorithms on GLUE tasks, with the pretrained RoBERTa model (baseline results are obtained from [1]). We report the Matthew’s correlation for the CoLA task, Pearson correlation for STS-B, average (matched and mismatched) accuracy for MNLI, F1 score for MRPC, and accuracy for all other tasks. Note that the reported memory stands for the estimated memory overhead for weights and optimizer states. End-to-end memory measurements are discussed at Section 4.3.
We present an end-to-end memory measurement for training a LLaMA-7B model in Figure 5. Starting from the baseline full parameter training with BF16 Adam optimizer, 8-bits Adam optimizer halves the memory overhead of the optimizer states by quantizing them to a lower precision format. Then, 8-bits GaLore further compresses the memory cost by converting the optimizer states into a low-rank format. Moreover, 8-bits GaLore employs a fused backward operation that sequentially releases the gradient memory, rendering the gradient memory cost negligible. Building on this, Q-GaLore incorporates INT8 weights, which halve the memory requirement for weights. Projection quantization then further reduces the memory allocated to optimizer states. Notably, only Q-GaLore can train a LLaMA-7B model within the 16 GB memory constraint, demonstrating the potential for optimizing models on edge devices. Additionally, due to the varying data formats of gradients and weights, the requisite quantization and dequantization operations incur a throughput overhead of 14.64%, as compared to the original GaLore. We will improve the implementation for further work.
Figure 5: Results of the memory allocation of training a LLaMA-7B model with a single batch size of 256.
In this section, we focus on the ablation studies of Q-GaLore, centering on two key questions: Q1: How does Stochastic Rounding (SR) benefit the training process? Q2: What is the trade-off between training performance and SVD counts in Q-GaLore?
A1: Enhanced low-precision training with stochastic rounding. Stochastic rounding provides an unbiased estimation of accumulated gradient information, which is crucial for low-precision training. We conducted controlled experiments to pre-train LLMs with and without stochastic rounding. To ensure a fair comparison, we maintained consistency in other hyperparameters across the experiments: weights were stored in the INT8 data format, projection matrices were subjected to 4-bit quantization, and the adaptive convergence ratio for the gradient subspace was set at 0.4.
Figure 6: Ablation study of pre-training with Q-GaLore w/ or w/o Stochastic Rounding (SR). Full curve stands for the perplexity of the final checkpoint that optimized by original Adam optimizer. Each subfigure includes a smaller inset that represents the zoomed-in results.
Figure 6 illustrates the perplexity on the validation set throughout the training process. At each training step, gradient information is quantized back to the low-precision format (INT8), resulting in considerable information loss and suboptimal performance. The perplexity increased by 7.86, 1.98, and 2.27 for models with sizes of 60, 160, and 350 million parameters, respectively. Additionally, we implemented an initial warm-up stage for pre-training for training stability, where the weight updates are generally smaller. During this stage, significant loss of gradient information occurs due to the vanilla roundto-nearest scheme, resulting in a perplexity gap ranging from 18.67 to 47.02, compared with models using stochastic rounding. Meanwhile, Q-GaLore can effectively capture the gradient information without additional memory costs, achieving performance comparable to the Full baseline, with a perplexity gap of less than 1.
Figure 7: Trade-off between performance and SVD counts for updating gradient subspace. Results are normalized by SVD counts of original GaLore.
A2: Over 60% SVD operations costs can be saved for free. We explore the trade-off between the number of SVD operations used for updating the gradient subspace and pre-training performance on the LLaMA-130M model. Figure 7 (Right) demonstrates that there is an efficient reduction in SVD counts; with only 36.20% of SVD operations, Q-GaLore can achieve comparable performance to the GaLore baseline, resulting in significant time savings. Specifically, to update the gradient subspace of a LLaMA-7B model, the SVD operation requires approximately 10 minutes when measured on a single NVIDIA RTX A6000 GPU; and this gradient subspace is updated 300 times across 150,000 training iterations. By achieving more than 60% savings in SVD operations, our method significantly reduces the time cost by over 32 hours.
To overcome these challenges and further enhance memory-efficient training, we propose Q-GaLore, a method that reduces memory usage through quantization and low-rank projection. Our approach is motivated by two key observations during gradient low-rank training: (1) the gradient subspace exhibits diverse properties, with some layers converging at the very early training stages while others are subject to frequent changes; (2) the projection matrices demonstrate high quantization-friendliness and function effectively under 4-bit quantization. Building on these, Q-GaLore enables low-precision training (INT8 for the entire model and INT4 for the projection matrix) with low-rank gradients and significantly fewer SVD operations. Our experiment results demonstrate that Q-GaLore achieves competitive pre-training and fine-tuning performance, e.g., for the first time facilitating training LLaMA-7B on a single NVIDIA RTX 4060 Ti with only 16GB memory.