LLM FineTune | Training with MXFP4

Created: 2025-03-06 07:32:42 +0000

Last modified: 2025-03-06 20:56:50 +0900

Training LLMs with MXFP4

url: https://arxiv.org/abs/2502.20586

pdf: https://arxiv.org/pdf/2502.20586

html: https://arxiv.org/html/2502.20586v1

github: https://github.com/NovaSky-AI/SkyThought

abstract: Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under this https URL.

keywords: MXFP4, MXFP6 Microscaling(MX)

TL;DR

주로 MXFP6에 대한 연구들 속에서 MXFP4 논문이 나왔습니다.

MXFP4와 같은 저정밀 데이터 타입은 GEMM 연산 가속과 학습 비용 절감을 가능하게 하지만, BF16 대신 직접 사용할 경우 모델 품질이 크게 저하됩니다. 본 연구는 MXFP4 GEMM을 활용한 무손실에 가까운 학습 레시피를 제시하며, 확률적 반올림(SR)을 통해 편향 없는 그래디언트 추정을 수행하고, 무작위 하다마드 변환으로 SR의 분산을 제어하여 GPT 모델(최대 6.7B 파라미터) 학습 시 BF16 대비 미미한 성능 저하와 백프로파게이션 단계에서 1.7배 이상의 속도 향상을 보일 수 있다고 보고하고 있습니다. 그러나 GPT 1.3에 한정된 연구고 큰 모델에서의 실증적인 후속 연구들이 나와야 시도해볼 수 있을 것 같고, 학습 프레임워크에서의 호환도 고려해야될 것 같습니다.

저정밀 데이터 타입 (MXFP4)

가속화: GEMM 연산을 빠르게 수행하여 전체 학습 속도를 높임
비용 절감: 연산 자원의 효율적 사용으로 학습 비용 절감
성능 비교: FP8 대비 2배 빠른 연산 성능 제공

BF16과의 비교

문제점: BF16 대신 MXFP4를 그대로 사용하면 모델 품질이 크게 저하됨
목적: 성능 저하 없이 저정밀 연산의 장점을 살릴 수 있는 방법 고안

무손실에 가까운 학습 레시피

목표: MXFP4의 빠른 연산 속도를 활용하면서도 모델 품질 저하를 최소화
접근법: 기존 BF16 대비 성능 저하 없이 효율적인 학습 진행

확률적 반올림 (Stochastic Rounding, SR)

핵심 아이디어: 편향 없는 그래디언트 추정을 통해 더 정확한 모델 업데이트
장점: 확률적 반올림을 적용하면 정밀도 손실 없이 학습 안정성 확보

SR 적용 시 문제점: 분산 증가

현상: MXFP4에 SR을 직접 적용하면 블록 단위 이상치에 의한 높은 분산이 발생
영향: 높은 분산은 학습 수렴에 부정적인 영향을 미침

무작위 하다마드 변환 (Random Hadamard Transform)

해결책: SR의 분산을 이론적으로 제한하여 높은 분산 문제를 완화
효과: 안정적인 그래디언트 추정과 학습 수렴을 도모

실험 결과 및 성능

모델 규모: 최대 6.7B 파라미터의 GPT 모델 학습
모델 품질: BF16 혼합 정밀도 학습 대비 미미한 성능 저하 확인
연산 효율: 전체 학습 FLOP의 절반 이상을 MXFP4로 계산
속도 향상: 백프로파게이션 시 FP8 대비 1.3배, BF16 대비 1.7배 이상의 속도 향상

LLM FineTune | Training with MXFP4

LLM FineTune | Training with MXFP4

LLM FineTune | Training with MXFP4

Training LLMs with MXFP4

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

LLM FineTune | Training with MXFP4

LLM FineTune | Training with MXFP4

Training LLMs with MXFP4

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views