00:00:00

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

https://dsdanielpark.github.io https://github.com/dsdanielpark

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

MinWoo(Daniel) Park | Tech Blog

Created: 2025-06-30 11:45:54 +0000

Last modified: 2025-06-30 20:56:50 +0900

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Related Project: Private
Category: Paper Review
Date: 2025-06-30

Self-Guided Process Reward Optimization

url https://arxiv.org/abs/2507.01551
pdf https://arxiv.org/pdf/2507.01551
abstract Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately 1/3, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.

post contain ""

No matching posts found containing ""

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views