00:00:00

Share Your Feedback 🏝️

Exploring Data Scaling Trends

Exploring Data Scaling Trends

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Qwen2.5-Omni Technical Report Next: Rediscovers a Semantic Variant of BM25

Exploring Data Scaling Trends

  • Related Project: Private
  • Category: Paper Review
  • Date: 2025-03-31

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

  • url: https://arxiv.org/abs/2503.22230
  • pdf: https://arxiv.org/pdf/2503.22230
  • html: https://arxiv.org/html/2503.22230v1
  • abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods’ effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.
Previous: Qwen2.5-Omni Technical Report Next: Rediscovers a Semantic Variant of BM25

post contain ""

    No matching posts found containing ""