Offline vs On-Policy Distillation

DeepSeek-R1-Distill은 강력한 offline distillation 사례다.

teacher가 먼저 좋은 reasoning trace를 만들고, student는 그 저장된 trace를 SFT로 따라 배운다.

teacher rollout -> stored dataset -> student training

on-policy distillation은 여기서 방향을 바꾼다.

student rollout -> teacher feedback on student states -> student update

둘 다 teacher-student 학습이지만, training state를 누가 만들었는가가 다르다.

Offline / Sequence KD

Rollout source: teacher output
Supervised states: teacher prefixes
Typical objective: CE or forward KL
Cost shape: cheap after data generation
Main risk: student may fail on its own prefixes

On-policy Distillation

Rollout source: student output
Supervised states: student prefixes
Typical objective: reverse KL or teacher-scored token feedback
Cost shape: teacher runs during training
Main risk: more expensive and infra-heavy

Prompt s

Student rollout a ~ pi_student

Teacher feedback pi_teacher(. | s, a_<t)

Update student match or reward teacher-preferred tokens

The dividing line is not whether a teacher exists. It is whose trajectories create the training states: offline distillation trains on teacher states, while on-policy distillation trains on states the current student actually visits.

Offline distillation

offline distillation은 teacher가 만든 데이터를 먼저 고정한다.

prompt s
teacher output u
dataset (s, u)
student learns p_student(u | s)

이 방식은 구현과 운영이 단순하다.

teacher를 한 번만 돌려도 된다.
dataset을 검수하고 재사용할 수 있다.
student training 중 teacher를 GPU에 올릴 필요가 없다.
API teacher를 써도 training loop와 분리된다.

앞에서 본 sequence distillation, reasoning distillation, DeepSeek-R1-Distill은 모두 이 범주에 가깝다.

목적식 관점에서는 teacher distribution 또는 teacher-generated data를 target으로 두고 student를 맞춘다. 로컬 RLHF reference는 이것을 forward KL 또는 SFT-like training으로 설명한다.

teacher trajectory u ~ pi_T
student learns on prefixes (s, u_<t)

즉 student는 teacher가 방문한 prefix에서 배운다.

Offline의 약점: exposure bias

문제는 inference 때 student가 teacher prefix 위에서만 움직이지 않는다는 점이다.

학습 때:

teacher prefix: s, u_1, u_2, ...

실행 때:

student prefix: s, a_1, a_2, ...

student가 초반에 작은 실수를 하면 prefix가 teacher data distribution에서 벗어난다. 그러면 student는 자신이 거의 학습하지 않은 state에서 다음 token을 선택해야 한다. 이 오차가 긴 sequence에서 누적되면 답변이 흔들리거나 hallucination이 커진다.

이 문제가 exposure bias다.

training states != inference states

offline distillation은 teacher의 좋은 답을 싸게 전이하지만, student가 자기 실수 이후 어떻게 회복해야 하는지는 직접 배우기 어렵다.

On-policy distillation

on-policy distillation은 student가 실제로 방문하는 state를 training 대상으로 삼는다.

1. 현재 student가 prompt에 답한다.
2. teacher가 그 student prefix에서 next-token distribution이나 score를 준다.
3. student는 자기 trajectory 위에서 teacher와 가까워지도록 업데이트된다.

핵심은 teacher가 여전히 고정되어 있어도, rollout은 student에서 나온다는 점이다.

a ~ pi_student(. | s)
state = (s, a_<t)
teacher feedback = pi_teacher(. | state)

그래서 on-policy distillation은 “student가 실제로 저지르는 실수”를 학습 루프 안으로 끌어온다. student가 엉뚱한 prefix로 들어가면, teacher는 바로 그 prefix에서 더 나은 다음 token 분포나 correction signal을 제공한다.

KL 방향도 달라진다

offline SFT/KD는 보통 teacher 또는 target data에서 sample을 뽑는다.

forward KL intuition:
sample from teacher / target
fit student to cover target behavior

on-policy distillation은 student가 만든 sample 위에서 teacher와의 차이를 본다.

reverse KL intuition:
sample from student
penalize student where teacher assigns low probability

MiniLLM류 논문이 강조한 지점도 이쪽이다. student가 직접 생성한 token과 prefix에서 teacher와의 차이를 측정하면, student의 현재 분포 근처에서 업데이트가 일어난다.

이 차이는 단순 수식 취향이 아니다. 어떤 distribution에서 sample을 뽑느냐가 학습 신호의 위치를 바꾼다.

offline:
  "좋은 teacher answer를 따라 와라"

on-policy:
  "네가 실제로 간 곳에서 teacher 기준으로 수정해라"

비용과 선택 기준

on-policy가 항상 더 좋은 선택은 아니다.

offline distillation이 맞는 경우:

teacher 호출 비용이 크다.
검수 가능한 static dataset을 만들고 싶다.
student가 teacher trace를 모방하는 것만으로도 충분하다.
여러 student를 같은 dataset으로 학습하고 싶다.

on-policy distillation이 끌리는 경우:

student의 failure mode가 teacher data와 많이 다르다.
긴 sequence에서 작은 오류가 크게 누적된다.
teacher가 student의 실제 시도에 대해 feedback을 줄 수 있다.
RL infrastructure나 online generation loop를 감당할 수 있다.

대신 on-policy는 teacher를 training loop 안에서 계속 호출해야 하므로 비싸다. teacher와 student의 tokenizer/logprob 비교가 필요할 수도 있고, rollout generation과 update를 번갈아 돌리는 infrastructure도 필요하다.

정리

offline과 on-policy의 차이는 한 문장으로 정리된다.

offline distillation은 teacher가 만든 길을 따라 배우고,
on-policy distillation은 student가 실제로 걸어간 길 위에서 teacher에게 교정받는다.

다음 카드에서는 이 on-policy distillation loop를 조금 더 구체적으로, student rollout, teacher scoring, KL/reward signal, update step으로 쪼개서 본다.

참고 자료

Rishabh Agarwal et al., On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Yuxian Gu et al., MiniLLM: Knowledge Distillation of Large Language Models
Kushal Arora et al., Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation
로컬 참고: reference-books/rlhf-book/ch12-Synthetic-Data.md
로컬 참고: reference-books/rlhf-book/ch15-Regularization.md
로컬 참고: reference-books/deepseek from scratch/ch08-Knowledge-distillation-Making-powerful-models-practical.md

확인

offline distillation과 on-policy distillation은 rollout source가 어떻게 다른가?
exposure bias는 왜 긴 LLM sequence에서 문제가 커지는가?
on-policy distillation이 더 비싼 이유는 무엇인가?