00:00:00

ZeroSearch

https://dsdanielpark.github.io https://github.com/dsdanielpark

ZeroSearch

MinWoo(Daniel) Park | Tech Blog

Created: 2025-05-10 11:45:54 +0000

Last modified: 2025-05-10 20:56:50 +0900

ZeroSearch

Related Project: Private
Category: Paper Review
Date: 2025-05-10

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

url: https://arxiv.org/abs/2505.04588
pdf: https://arxiv.org/pdf/2505.04588
html: https://arxiv.org/html/2505.04588v1
abstract: Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs’ search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

1. 문제 정의와 연구 동기

문제
- 대규모 언어 모델(LLM)은 사전 학습 때 본 데이터에 지식이 “고정”되어 있어 → 시간이 지나면 out-of-date 정보나 hallucination이 발생해 실제 업무에서 신뢰성이 떨어지는 문제가 있음
- 이를 보완하려면 external knowledge를 실시간으로 불러오는 search tool-use 능력이 꼭 필요
기존 접근법의 한계
1. RAG(Retrieval-Augmented Generation)
  - 프롬프트 안에서 “검색 → 응답”을 시도하지만, prompt engineering 부담이 크고, 검색 결과의 품질이 균질하지 않은 문제가 있음.
2. RL + Real Search (예: Search-R1, DeepResearcher)
  - RL은 수십만 번의 rollout이 필요합니다. 실제 Google API를 그만큼 호출하면 API 요금이 기하급수로 증가하게 되는 문제가 있고,
  - 검색 결과 문서의 품질이 매 스텝 달라 gradient noise가 커지는 문제가 있음.
ZeroSearch 아이디어
- “검색엔진” 역할을 작은 LLM으로 대신하여 가짜(offline) 검색 환경을 만든 뒤,
- curriculum learning으로 문서 품질을 점차 악화(Easy → Hard)시켜 policy LLM이 robust reasoning을 학습하게 함.
- 실제 인터넷 호출이 없으므로 비용은 0이고, 시뮬레이터 품질을 정밀 제어할 수 있음.

2. 최적화 목표 (RL Formulation)

\[\max_{\pi_\theta}\; \underbrace{\mathbb{E}_{x\sim\mathcal D,\;y\sim\pi_\theta(\cdot|x;\pi_\psi)}\!\left[r_\varphi(x,y)\right]}_{\text{(a) 예상 보상}} -\; \beta\, \underbrace{D_{\mathrm{KL}}\!\bigl[\pi_\theta(y|x;\pi_\psi)\;\bigl\Vert\;\pi_{\text{ref}}(y|x;\pi_\psi)\bigr]}_{\text{(b) 참조 모델로부터의 괴리 패널티}}\]

기호	의미	자연어 해석
$x$	질문(question)	모델이 답해야 하는 문제
$\mathcal D$	질문 분포	학습·평가용 문제 풀(pool)
$y$	모델의 전체 출력	`<think>` + `<search>` + `<answer>` 토큰 시퀀스
$\pi_\theta$	Policy LLM	지금 학습시키려는 주인공 모델
$\pi_\psi$	Simulation LLM	‘가짜 Google’ 역할, 파라미터 고정
$r_\varphi$	Reward 함수	답의 F1 정확도(아래 5절)
$\pi_{\text{ref}}$	참조(reference) 모델	KL 패널티서 로 들어가되는 한구버전/동방지함.모델
$\beta$	온도 계수	보상 vs. 안정성의 트레이드오프

직관적 비유

항 (a)는 “선수가 달려서 얻은 메달 점수”로 점수가 클수록 좋음
항 (b)는 “코치가 그린 안전 코스”에서 벗어난 거리, 너무 벗어나면 패널티가 커짐.
$\beta$는 “안전 로프”의 장력으로, 0이면 자유 탐험, 값이 크면 코치 코스를 강제함.

3. 학습용 프롬프트 구조 (Training Template)

<think> 내부 추론 (Chain-of-Thought) </think>
<search> Query_1 </search>
<information>
  Doc 1
  Doc 2
  ...
</information>
<think> 새 정보 반영한 추론 </think>
...
<answer> 최종 답 </answer>

Think 단계: 모델은 자신만의 Reasoning 과정을 <think> 태그 안에 기록 → 이 부분은 사람에게 보여주지 않아도 RL에서는 gradient가 흐르게 됨.
Search 단계: 모델이 도움 필요 판단 시 <search> 태그로 query string을 던짐.
Information 단계: 시뮬레이터 LLM이 query에 대한 5개의 문서(약 30단어)를 반환
필요한 만큼 반복 후, <answer> 태그에 짧고 확정적인 정답만 넣음.

Why 3-way 분리?

Debug 가능: 어디서 논리 오류가 났는지 트레이싱 가능

Reward 단일화: 정답만 보상하고, 포맷은 규칙으로 강제해 hacking을 방지

4. Search Simulation LLM 만들기 (Step-by-Step)

4.1 Trajectory 수집

실제 Google Web Search를 호출해 multi-turn QA 세션을 생성하고,
정답을 맞힌 세션은 positive, 틀린 세션은 negative로 라벨링 수행
- Positive 세션 → “유용한(Useful)” 문서
- Negative 세션 → “노이즈(Noise)” 문서

4.2 Prompt-Level 품질 스위치

You are the Google search engine.
Given a query, generate five [useful / noisy] documents...
The user tries to answer: [QUESTION] whose answer is [ANSWER].

[useful / noisy] 단어 하나만 바꿔 넣어 문서 품질을 손쉽게 제어하고,
QUESTION과 ANSWER를 prompt에 같이 넣어 지식 경계를 최대한 넓혀 줍니다.

4.3 Lightweight SFT

Backbone: Qwen-2.5-3B / 7B / 14B
Learning rate: $1!\times!10^{-6}$ (over-fitting 방지)
결과: 7B만 되어도 Google 수준을 복제, 14B는 일부 벤치마크에서 Google을 추월

화가에게 “선명한 그림 vs. 일부러 번진 그림” 둘 다 그리게 연습시키는 것과 비슷하고, 문장 몇 줄로 화풍(noise level)을 스위치할 수 있으니, curriculum을 통해 쉽게 활용할 수 있게 됩니다.

5. Curriculum Roll-out 전략

5.1 Noise 확률 수식

\[p_i = p_s + \frac{b^{\,\tfrac{i}{m}} - 1}{\,b-1\,}\;(p_e - p_s), \quad b=4\]

기호	설명
$p_i$	학습 스텝 $i$에서 시뮬레이터가 노이즈 문서를 낼 확률
$p_s$	start probability (예: 0.1)
$p_e$	end probability (예: 0.9)
$i$ / $m$	현재 스텝 / 총 스텝 (0~1의 진행률)
$b$	지수 성장율 (4면 완만→급격)

초기(학습 0%): 문서 90%가 유용 → 모델은 먼저 search API 사용법과 출력 포맷을 익히고,
후기(학습 100%): 문서 90%가 노이즈 → 모델은 잡음 속에서 핵심 단서를 찾는 robust reasoning을 배우도록 학습

6. Reward 함수 — F1 기반

\[r_\varphi(x,y)= \frac{2 \times \text{IN}}{\text{PN} + \text{RN}}\]

항목	수식 내 의미	예시
IN (Intersection)	예측·정답이 공통으로 가진 단어 수	“Alexander Hamilton”, “Hamilton” → IN = 1
PN (Prediction Len)	예측 단어 수	2
RN (Reference Len)	정답 단어 수	1

Precision = IN / PN, Recall = IN / RN, F1은 조화 평균
Exact-Match(EM)는 단어 하나라도 다르면 0이지만, 모델이 모든 가능 단어를 나열하는 편법을 유발하게 되서,
F1은 길이 패널티로 들어가 장황한 답변을 방지함.

7. RL 알고리즘 적용과 Gradient Masking

알고리즘	특징	ZeroSearch 적용
PPO	KL 클리핑, 안정적	πθ와 별도 value head 사용
GRPO	그룹 rollout, variance ↓	동일 prompt에 5개 응답 샘플
Reinforce++	baseline 감산	논문 부록에서 호환

Gradient Masking

문서 토큰(시뮬레이터 출력)은 모델이 직접 생성하지 않은 외부 입력이므로 $\nabla_\theta \log \pi_\theta(\text{doc}) = 0$으로 처리해 학습 불안정을 방지

8. 실험 디자인

모델 패밀리
- Qwen-2.5(3B/7B Base·Instruct)
- LLaMA-3.2-3B(Base·Instruct)
데이터셋
- Train: Natural Questions + HotpotQA (단일·다중 hop 혼합)
- Eval: 7개 QA 벤치마크 (NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, Bamboogle)
검색 비교 대상
- 실제 Google Search (SerpAPI) top-5 문서
하이퍼파라미터
- πθ LR = 1e-6, Value LR = 1e-5
- GAE(λ=1, γ=1) — 길게 이어지는 reasoning에 보상 지연 문제 최소화

9. 실험 결과와 정성적 분석

9.1 성능 하이라이트

Qwen-7B 기준 ZeroSearch EM 평균 40.5 > Search-R1(Google) 39.2
Qwen-14B 시뮬레이터 사용 시, 일부 데이터셋에서 Google을 능가
Curriculum “Easy→Hard”가 “Hard→Easy”보다 2~5 p 높은 EM → 커리큘럼 방향성 검증

9.2 학습 동역학

Reward Curve
- 초반 완만(도구 사용법 학습) → 중반 급등(문서 신호 활용) → 후반 plateau(난이도↑ 상쇄)
Interaction Turns
- 초기 한 질문에 10회 검색, 후기에 3~4회로 감소 → 불필요 검색 줄이며 효율 최적화

9.3 시뮬레이터 규모 vs. 품질

Simulation LLM	Google 대비	실무 코멘트
Prompt-3B	-12%	Prompt-engineering만으로는 역부족
SFT-3B	-0%~-5%	작은 GPU로 시작 가능
SFT-7B	≈ Google	2-GPU 서버 권장
SFT-14B	+1%~+3%	지식 범위가 넓어 noise 식별 ↑

10. 실무 관점 Insights

Compute vs API Cost Trade-off
- 시뮬레이터 구동용 GPU 임대료가 추가되지만, API 호출료(SerpAPI $0.005 × 10^5 call)와 비교하면 훨씬 저렴
Domain Customization
- 의료·법률 등 특수 분야는 도메인 코퍼스를 포함해 SFT하면 “가짜 PubMed”·“가짜 LexisNexis” 구축 가능
Scalable Curriculum
- Noise를 단순 확률이 아니라 주제 미스매치, 시계열 오류 등 유형별 난이도로 확장해, 다차원적 견고성 훈련이 가능
Hybrid Mode 가능성
- 초반에는 ZeroSearch로 저렴히 pre-train, 후반 few epoch만 실제 검색으로 fine-tune하면 최신 정보 coverage를 확보할 수 있음.

11. 제한 사항과 미래 연구

제한	상세 설명	가능 연구 방향
시뮬레이터 최신성 부족	LLM컷오프 이후 뉴스·이벤트 미포함	주기적 incremental SFT, Streaming Retraining
GPU 인프라 필요	14B 시 단일 A100 ≈ ~70 ms/5 docs	LLM distillation으로 3B sim 빠르게 대체
고품질 라벨링 비용	positive/negative trajectory 수집에 real search 호출은 여전히 필요	Active learning으로 minimal set만 샘플링

post contain ""

No matching posts found containing ""

ZeroSearch

ZeroSearch

ZeroSearch

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

1. 문제 정의와 연구 동기

실제 인터넷 호출이 없으므로 비용은 0이고, 시뮬레이터 품질을 정밀 제어할 수 있음.

2. 최적화 목표 (RL Formulation)

3. 학습용 프롬프트 구조 (Training Template)

4. Search Simulation LLM 만들기 (Step-by-Step)

4.1 Trajectory 수집

4.2 Prompt-Level 품질 스위치

4.3 Lightweight SFT

5. Curriculum Roll-out 전략

5.1 Noise 확률 수식

6. Reward 함수 — F1 기반

7. RL 알고리즘 적용과 Gradient Masking

8. 실험 디자인

9. 실험 결과와 정성적 분석

9.1 성능 하이라이트

9.2 학습 동역학

9.3 시뮬레이터 규모 vs. 품질

10. 실무 관점 Insights

11. 제한 사항과 미래 연구

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

ZeroSearch

ZeroSearch

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

1. 문제 정의와 연구 동기

실제 인터넷 호출이 없으므로 비용은 0이고, 시뮬레이터 품질을 정밀 제어할 수 있음.

2. 최적화 목표 (RL Formulation)

3. 학습용 프롬프트 구조 (Training Template)

4. Search Simulation LLM 만들기 (Step-by-Step)

4.1 Trajectory 수집

4.2 Prompt-Level 품질 스위치

4.3 Lightweight SFT

5. Curriculum Roll-out 전략

5.1 Noise 확률 수식

6. Reward 함수 — F1 기반

7. RL 알고리즘 적용과 Gradient Masking

8. 실험 디자인

9. 실험 결과와 정성적 분석

9.1 성능 하이라이트

9.2 학습 동역학

9.3 시뮬레이터 규모 vs. 품질

10. 실무 관점 Insights

11. 제한 사항과 미래 연구

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views