00:00:00

Share Your Feedback 🏝️

Code Model | Arctic-SnowCoder

Code Model | Arctic-SnowCoder

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Diffusion | Masked Diffusion Models Next: Math Model | S3c-Math

Code Model | Arctic-SnowCoder

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-09-03

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

  • url: https://arxiv.org/abs/2409.02326
  • pdf: https://arxiv.org/pdf/2409.02326
  • html: https://arxiv.org/html/2409.02326v1
  • abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of “high-quality” remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

Previous: Diffusion | Masked Diffusion Models Next: Math Model | S3c-Math

post contain ""

    No matching posts found containing ""