00:00:00

Share Your Feedback 🏝️

Scaling Law with LR Annealing

Scaling Law with LR Annealing

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Mamba-2, Transformers to SSMs Next: To Code, or Not To Code

Scaling Law with LR Annealing

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-08-20

Scaling Law with Learning Rate Annealing

  • url: https://arxiv.org/abs/2408.11029
  • pdf: https://arxiv.org/pdf/2408.11029
  • html: https://arxiv.org/html/2408.11029v1
  • abstract: We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps $(s)$: \(L(s) = L_0 + A \cdot S^{-\alpha} - C \cdot S^2\) Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.

*참고: LLaMA-3의 어닐링(3.1.3) 및 스케일링 법칙 관련 파트


Previous: Model | Mamba-2, Transformers to SSMs Next: To Code, or Not To Code

post contain ""

    No matching posts found containing ""