abstract: In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and POST-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of POST-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and VLM models.
abstract: In this report, we introduce Qwen2.5-1M, a series of models that extend the context
length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M
series have significantly enhanced long-context capabilities through long-context pretraining and POST-training. Key techniques such as long data synthesis, progressive
pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance
long-context performance while reducing training costs.
To promote the use of long-context models among a broader user base, we present and
open-source our inference framework. This framework includes a length extrapolation
method that can expand the model context lengths by at least four times, or even more,
without additional training. To reduce inference costs, we implement a sparse attention
method along with chunked prefill optimization for deployment scenarios and a sparsity
refinement method to improve precision. Additionally, we detail our optimizations in
the inference engine, including kernel optimization, pipeline parallelism, and scheduling
optimization, which significantly enhance overall inference performance. By leveraging
our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill
speedup in scenarios with 1 million tokens of context. This framework provides an
efficient and powerful solution for developing applications that require long-context
processing using open-source models.
The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo.
Evaluations show that Qwen2.5-1M models have been greatly improved in long-context
tasks without compromising performance in short-context scenarios. Specifically, the
Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context
tasks and supports contexts eight times longer.