Scaling Law with LR Annealing

Created: 2024-08-23 02:00:04 +0000

Last modified: 2024-09-05 20:56:50 +0900

Scaling Law with Learning Rate Annealing

url: https://arxiv.org/abs/2408.11029

pdf: https://arxiv.org/pdf/2408.11029

html: https://arxiv.org/html/2408.11029v1

abstract: We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps $(s)$: $L(s) = L_0 + A \cdot S^{-\alpha} - C \cdot S^2$ Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.

*참고: LLaMA-3의 어닐링(3.1.3) 및 스케일링 법칙 관련 파트

Scaling Law with LR Annealing

Scaling Law with LR Annealing

Scaling Law with LR Annealing

Scaling Law with Learning Rate Annealing

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Scaling Law with LR Annealing

Scaling Law with LR Annealing

Scaling Law with Learning Rate Annealing

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views