On-Policy Distillation Loop

앞 카드의 핵심은 이것이었다.

offline distillation:
  teacher가 만든 trajectory에서 student가 배운다.

on-policy distillation:
  student가 실제로 만든 trajectory에서 teacher feedback을 받는다.

이번 카드는 on-policy distillation을 실제 training loop처럼 쪼개서 본다.

Sample prompts

s ~ D

Draw tasks from the training prompt distribution.

Student rollout

a ~ pi_theta(. | s)

The current student generates the completion.

Teacher scores states

pi_T(. | s, a_<t)

Evaluate teacher probabilities on student prefixes.

Token feedback

log pi_T(a_t) - log pi_theta(a_t)

Convert teacher-student gap into KL or advantage signal.

Update student

policy / KD update

Increase teacher-approved tokens and reduce weak ones.

Dense OPD signal A_t = log pi_T(a_t | s_t) - log pi_theta(a_t | s_t)

A sampled token gets a positive signal when the teacher assigns it more probability than the student expected, and a negative signal when the teacher rates it poorly.

Freshness

Are rollouts from the current student?

Teacher access

Can we get logprobs or reliable scores?

Stability

Are KL, clipping, or reference controls limiting drift?

OPD alternates generation and learning. The student creates the states, the teacher judges those states, and the update teaches the student how to behave on its own trajectory distribution.

1. Prompt를 뽑는다

처음에는 일반 post-training loop와 비슷하다.

s ~ training prompt distribution

여기서 prompt distribution은 수학 문제, 코드 문제, 일반 instruction, agent task 등일 수 있다. 중요한 것은 prompt 자체보다 현재 student가 이 prompt에 대해 어떤 trajectory를 만드는가다.

offline distillation에서는 prompt를 teacher에게 넣어 target response를 만든다. on-policy distillation에서는 prompt를 student에게 넣는다.

2. Student가 rollout을 만든다

현재 student policy를 (\pi_\theta)라고 하자. student가 prompt s에서 completion a를 sample한다.

a = (a_1, ..., a_L)
a ~ pi_theta(. | s)

이 rollout은 student의 현재 능력과 약점을 그대로 반영한다.

잘하는 prompt에서는 좋은 prefix를 만든다.
헷갈리는 prompt에서는 이상한 prefix로 들어간다.
긴 답변에서는 중간 실수 이후 새로운 state가 생긴다.

on-policy distillation이 노리는 지점은 바로 이것이다. teacher가 완벽한 답을 새로 써 주는 것이 아니라, student가 실제로 방문한 prefix에서 feedback을 준다.

3. Teacher가 visited state를 평가한다

각 token 시점의 state는 다음처럼 볼 수 있다.

s_t = (s, a_<t)

teacher는 이 state에서 다음 token에 대한 분포나 logprob을 제공한다.

pi_T(. | s_t)

그러면 student가 실제로 뽑은 token a_t에 대해 teacher와 student의 평가 차이를 계산할 수 있다.

teacher logprob: log pi_T(a_t | s_t)
student logprob: log pi_theta(a_t | s_t)

teacher가 student보다 a_t를 더 그럴듯하게 보면 그 token은 강화할 만하다. teacher가 훨씬 낮게 보면 그 token은 줄여야 할 가능성이 크다.

4. Token-level feedback을 만든다

로컬 RLHF reference는 OPD의 dense feedback을 다음처럼 advantage-like signal로 설명한다.

A_t = log pi_T(a_t | s_t) - log pi_theta(a_t | s_t)

직관은 단순하다.

A_t > 0:
  teacher가 student보다 이 token을 더 좋게 본다.
  student는 이 token의 확률을 올릴 수 있다.

A_t < 0:
  teacher가 이 token을 낮게 본다.
  student는 이 token의 확률을 낮출 수 있다.

이것은 sparse reward와 다르다. RLVR에서는 전체 답변이 맞았는지 틀렸는지 같은 sequence-level reward가 중심이 된다. OPD에서는 teacher distribution을 사용해 token-level로 훨씬 촘촘한 feedback을 만들 수 있다.

5. Student를 업데이트한다

업데이트는 구현에 따라 다를 수 있다.

reverse KL loss:
  D_KL(pi_student(. | s_t) || pi_teacher(. | s_t))

policy-gradient style:
  sampled token logprob에 teacher-derived advantage를 곱한다.

hybrid:
  GRPO/PPO류 objective에 OPD signal을 reward나 advantage로 섞는다.

중요한 것은 student가 자기 rollout distribution 근처에서 업데이트된다는 점이다. 이 때문에 on-policy distillation은 student의 실제 failure mode를 다룰 수 있다.

student가 간 곳에서
teacher가 평가하고
그 자리에서 student를 고친다.

구현에서 조심할 점

on-policy distillation은 개념은 깔끔하지만 구현은 무겁다.

rollout freshness:
  오래된 student가 만든 completion으로 업데이트하면 off-policy가 된다.

teacher throughput:
  teacher logprob 계산이 training loop 병목이 될 수 있다.

tokenizer alignment:
  per-token KL을 쓰려면 teacher/student token space 비교가 어려울 수 있다.

stability control:
  KL control, clipping, reference model, reward normalization이 필요할 수 있다.

또 하나의 실전 문제는 teacher feedback의 품질이다. teacher가 student의 이상한 prefix에서도 의미 있는 다음 token 분포를 줄 수 있어야 한다. student가 너무 망가진 state로 들어가면 teacher feedback도 해석하기 어려워질 수 있다.

Offline loop와 비교하면

두 loop를 나란히 보면 차이가 분명하다.

offline:
  teacher generate -> store target -> student imitate

on-policy:
  student generate -> teacher score -> student update

offline은 data pipeline에 가깝고, on-policy는 training system에 가깝다.

그래서 DeepSeek-R1-Distill 같은 사례는 offline으로 설명하기 좋다. 반대로 student가 자기 실수에서 회복하는 능력, 긴 trajectory에서 compounding error를 줄이는 문제, 여러 teacher를 prompt별로 섞는 문제는 on-policy distillation으로 넘어가야 설명하기 쉽다.

정리

on-policy distillation loop의 본질은 다음 한 줄이다.

student가 만든 trajectory 위에서 teacher가 dense supervision을 주고,
student는 그 trajectory distribution 안에서 업데이트된다.

이 방식은 exposure bias를 줄일 수 있지만, teacher 호출 비용과 training infrastructure 비용을 크게 만든다. 다음 카드에서는 이 distillation 계열에서 실제로 어떤 failure modes가 자주 생기는지 정리한다.

참고 자료

Rishabh Agarwal et al., On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Yuxian Gu et al., MiniLLM: Knowledge Distillation of Large Language Models
Kushal Arora et al., Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation
로컬 참고: reference-books/rlhf-book/ch12-Synthetic-Data.md
로컬 참고: reference-books/rlhf-book/ch06-Reinforcement-Learning.md
로컬 참고: reference-books/deepseek from scratch/ch07-Reinforcement-learning-From-policy-gradients-to-GRPO.md

확인

on-policy distillation loop에서 rollout은 teacher가 만드는가, student가 만드는가?
A_t = log pi_T(a_t | s_t) - log pi_theta(a_t | s_t)는 어떤 직관을 주는가?
OPD가 offline distillation보다 구현 비용이 큰 이유는 무엇인가?