abstract: While large language models (LLMs) have demonstrated exceptional capabilities in challenging tasks such as mathematical reasoning, existing methods to enhance reasoning ability predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data after pre-training. However, these approaches critically depend on external supervisions–such as human labelled reasoning traces, verified golden answers, or pre-trained reward models–which limits scalability and practical applicability. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. EMPO does not require any supervised information for incentivizing reasoning capabilities (i.e., neither verifiable reasoning traces, problems with golden answers, nor additional pre-trained reward models). By continuously minimizing the predictive entropy of LLMs on unlabeled user queries in a latent semantic space, EMPO enables purely self-supervised evolution of reasoning capabilities with strong flexibility and practicality. Our experiments demonstrate competitive performance of EMPO on both mathematical reasoning and free-form commonsense reasoning tasks. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 30.7% to 48.1% on mathematical benchmarks and improves truthfulness accuracy of Qwen2.5-7B Instruct from 87.16% to 97.25% on TruthfulQA.
TL;DR
Entropy Minimized Policy Optimization (EMPO)는 외부의 레이블이나 사전학습된 리워드 모델 없이, LLM(대형언어모델)의 추론 능력을 강화하기 위해 순수한 자가지도(fully unsupervised) 강화학습 기법을 제안합니다.
EMPO는 모델이 생성하는 출력들의 의미(semantic) 분포의 엔트로피를 최소화함으로써, 일관되고 신뢰할 수 있는 추론 결과를 도출하도록 유도합니다.
실험 결과, 수학 문제와 자유 형식 상식 질문 모두에서 기존 감독 방식 기반 모델과 경쟁력 있는 성능을 달성하였습니다.