Async Rollout and Weight Sync

작은 RL 학습은 한 프로세스 안에서 generate와 update를 번갈아 할 수 있다.

generate -> reward -> backward -> optimizer step

대규모 RL에서는 이 구조가 느리다. rollout serving과 training을 분리해야 한다.

rollout workers  -> samples -> learner
learner          -> new weights -> rollout workers

왜 분리하는가

rollout은 긴 generation과 environment interaction 때문에 느리다. training GPU가 rollout을 기다리면 utilization이 낮아진다.

training GPU idle
  rollout이 끝날 때까지 기다림

rollout GPU idle
  weight update를 기다림

비동기 구조는 두 작업을 겹쳐 처리한다.

rollout worker가 policy v10으로 sample을 만들었는데 learner는 이미 v12일 수 있다.

sample generated by policy v10
update applied to policy v12

이 차이가 policy mismatch다. PPO/GRPO의 old logprobs, importance ratio, truncated importance sampling 같은 장치가 여기서 중요해진다.

rollout engine은 최신 policy로 generation해야 한다. 그래서 training side에서 rollout side로 weight를 보내야 한다.

full checkpoint sync
delta weight sync
disk-based update
HTTP update endpoint
shared filesystem

모델이 클수록 weight sync 자체가 병목이 된다. Slime이 external rollout engines와 delta weight sync를 문서화하는 이유가 여기에 있다.

agentic rollout은 길이 편차가 크다.

sample A: 10초
sample B: 8분
sample C: timeout

동기식으로 기다리면 가장 느린 sample이 전체 batch를 막는다. 그래서 fully async rollout, partial completion 관리, replay/debug path가 중요해진다.