L1 · MinWoo Park

Created: 2025-03-11 02:44:24 +0000

Last modified: 2024-09-05 20:56:50 +0900

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

url: https://arxiv.org/abs/2503.04697

pdf: https://arxiv.org/pdf/2503.04697

html: https://arxiv.org/html/2503.04697v1

abstract: Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer’‘-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at this https URL

L1

L1

L1

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

L1

L1

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views