Agentic RL: Environments and Tools

agentic RL은 모델이 한 번 답하고 끝나는 문제가 아니다.

observe -> think -> tool call -> observe -> act -> ...

환경과 상호작용하면서 여러 step 뒤에 reward를 받는다.

일반 RLVR과의 차이

수학 RLVR은 보통 prompt 하나와 final answer 하나로 끝난다.

math prompt -> completion -> answer verifier

agentic RL은 중간 행동이 있다.

task
  -> shell command
  -> file edit
  -> test run
  -> browser action
  -> final result
  -> reward

이때 action은 token만이 아니라 tool call과 environment interaction까지 포함한다.

agentic RL의 reward는 다양하다.

unit tests passed
browser task completed
game score improved
compiler accepted code
human or model judge preference
environment terminal state

reward가 늦게 오고, 중간 step이 길기 때문에 rollout 비용과 variance가 커진다.

코딩 agent나 browser agent는 실제 실행을 한다.

untrusted generated code
shell commands
network requests
file writes
browser interaction

그래서 sandbox, timeout, permission, hidden tests, trace logging이 training loop의 일부가 된다.

TRL 쪽에서는 Harbor와 OpenEnv 예제가 agentic RL의 작은 형태를 보여 준다.

environment_factory
reward_funcs
BrowserGym / Wordle / CARLA examples

Slime 쪽에서는 custom data generation과 coding agent RL 예제가 대규모 agentic rollout을 training/data buffer 루프에 붙이는 방향을 보여 준다.

custom generate function
sandboxed tool use
test-based rewards
fully async rollout