LLM RL Post-Training

이 경로는 LLM RL을 알고리즘 이름 목록이 아니라 post-training 시스템으로 이해하기 위한 path다.

핵심 질문은 다음이다.

모델이 직접 답을 생성하고,
그 답을 reward/verifier/environment가 평가하고,
그 신호로 policy를 다시 업데이트하려면
무엇이 필요해지는가?

처음에는 pretraining, SFT, preference tuning, RLHF, RLVR, DPO, distillation의 위치를 잡는다. 그 다음 LLM 생성 과정을 state, action, reward로 번역하고, rollout과 on-policy data가 왜 RL 학습의 중심인지 본다.

중반부에서는 reward model, RLHF, PPO, DPO, RLVR, GRPO를 차례로 다룬다. 후반부에서는 reward hacking, KL control, training metrics를 보고, TRL과 Slime이 각각 어떤 층의 문제를 해결하는지 비교한다.

마지막에는 agentic RL, async rollout, weight sync까지 연결해 작은 실험과 대규모 RL infra 사이의 차이를 정리한다.

LLM Post-Training Map — pretraining 이후 SFT, preference tuning, RLHF, RLVR, DPO, distillation이 어디에 놓이는지 한 장의 지도처럼 본다.
RL for LLMs: State, Action, Reward — 일반 RL의 state/action/reward를 LLM token generation에 대응시켜 본다.
Rollout and On-Policy Data — LLM RL에서 현재 policy가 직접 생성한 rollout이 왜 학습 데이터가 되는지 이해한다.
Reward Models and Preference Data — RLHF에서 human preference pair가 scalar reward model로 바뀌는 과정을 본다.
RLHF Loop — SFT policy, reward model, reference model, KL penalty, policy update가 어떻게 하나의 RLHF loop를 이루는지 본다.
PPO for LLMs — PPO가 policy ratio, clipping, advantage, value model로 LLM RL update를 안정화하는 방식을 이해한다.
DPO Without Online Rollouts — DPO가 reward model과 online PPO loop 없이 preference pair에서 직접 policy를 업데이트하는 방식을 본다.
RLVR: Verifiable Rewards — reward model 대신 정답 검증기, 테스트, 실행 환경이 reward를 주는 RLVR 구조를 이해한다.
GRPO: Group-Relative Policy Optimization — GRPO가 같은 prompt에서 여러 completion을 뽑아 group-relative advantage를 만드는 방식을 이해한다.
Reward Hacking and KL Control — reward가 올라가도 실제 품질이 망가질 수 있는 이유와 KL/reference control의 역할을 이해한다.
LLM RL Training Metrics — reward_avg만 보지 않고 KL, entropy, response length, invalid format, held-out performance를 함께 추적해야 하는 이유를 본다.
TRL Trainer Workflow — TRL이 SFT, reward modeling, DPO, GRPO를 trainer abstraction으로 어떻게 연결하는지 본다.
Slime: RL Scaling Infrastructure — Slime이 Megatron training, SGLang rollout, Data Buffer, weight sync를 묶어 대규모 RL post-training을 다루는 방식을 본다.
Agentic RL: Environments and Tools — tool use, browser, coding sandbox, multi-turn environment가 들어오면 LLM RL이 어떻게 달라지는지 본다.
Async Rollout and Weight Sync — 대규모 LLM RL에서 rollout generation과 training update를 분리할 때 생기는 staleness와 weight sync 문제를 이해한다.
RLHF vs DPO vs GRPO Selection Map — 목적, 데이터, reward source, compute budget에 따라 RLHF/PPO, DPO, GRPO/RLVR 중 무엇을 고를지 비교한다.