Distillation Evaluation and Recipe

모델 증류의 마지막 질문은 “loss가 내려갔는가?”가 아니다.

student가 배포될 조건에서
teacher가 전달하려던 능력을
검증 가능한 형태로 실제 수행하는가?

그래서 evaluation은 하나의 benchmark score가 아니라 gate들의 묶음이어야 한다.

Data gate

Can the teacher data be trusted?

license and provenance
deduplication
decontamination
verified targets

Fit gate

Did the student learn the teacher targets?

held-out validation loss
answer-token loss
format stability
overfit check

Task gate

Did real task success improve?

exact match
verifier accuracy
pass@k
judge audit

Robustness gate

Does the gain survive distribution shift?

held-out domains
perturbation tests
long rollouts
failure review

Ship gate

Is this student worth deploying?

latency
serving cost
teacher gap
safety regressions

Decision rule Ship only when imitation fit, task success, robustness, provenance, and cost all improve against the base student.

A distilled model should pass through multiple gates. Validation loss shows whether the student copied the target distribution; benchmark and deployment gates show whether the copied behavior is useful.

1. 먼저 target capability를 좁힌다

distillation은 teacher 전체를 복사하는 작업이 아니다.

작은 student는 capacity가 제한되어 있기 때문에, 어떤 능력을 옮길지 먼저 정해야 한다.

bad goal: teacher처럼 전부 잘하는 7B 모델 만들기
good goal: 수학 풀이에서 teacher의 reasoning trace를 따라 하는 7B 모델 만들기
good goal: 내부 코드베이스 QA에서 teacher 수준의 답변을 더 싸게 만들기

target capability가 정해져야 teacher, prompt source, verifier, benchmark가 같이 정해진다.

DeepSeek-R1-Distill 사례도 “671B teacher의 모든 능력 복제”가 아니라, 검증 가능한 reasoning trace를 작은 Qwen/Llama 계열 student에 SFT한 사례로 보는 편이 정확하다.

2. Teacher data를 믿을 수 있는지 먼저 본다

hard distillation에서 teacher output은 곧 training target이다.

따라서 teacher data가 틀리면 student는 그 오류를 정답처럼 배운다.

데이터 gate에서는 최소한 다음을 확인한다.

teacher model과 version이 기록되어 있는가?
teacher output을 student training에 써도 되는 license인가?
prompt source와 benchmark source가 섞이지 않았는가?
deduplication과 decontamination을 했는가?
correctness verifier나 quality filter를 통과했는가?

AI Engineering 로컬 참고서는 synthetic data의 핵심 위험을 quality, imitation limit, model collapse, data lineage로 정리한다. 특히 lineage가 흐리면 benchmark contamination과 상업적 사용 가능성을 판단하기 어렵다.

3. Training signal은 imitation fit만 말한다

distillation training 중 가장 먼저 보는 신호는 validation loss다.

reasoning-model-from-scratch의 증류 챕터도 held-out validation loss를 training progress의 주요 신호로 둔다. validation loss가 내려간다는 것은 student가 teacher-generated target sequence를 더 잘 맞추고 있다는 뜻이다.

하지만 이것은 capability 증명이 아니다.

validation loss 감소
  = teacher target을 더 잘 모방함
  != 새로운 문제를 더 잘 해결함

특히 reasoning distillation에서는 loss를 어디에 걸었는지도 중요하다. prompt token까지 학습하면 “문제를 베껴 쓰는 능력”에 loss가 낭비될 수 있다. 보통은 response token, answer token, reasoning trace token처럼 student가 생성해야 하는 부분에 loss를 둔다.

4. Task success는 task별 evaluator로 본다

능력 평가는 domain별로 다르다.

math: exact match, boxed answer parser, verifier accuracy
code: unit test, pass@k, hidden test
open-ended chat: pairwise preference, LLM-as-judge, human audit
knowledge QA: exact match, citation check, retrieval-grounded audit

RLHF Book의 evaluation 챕터는 exact match, log-likelihood scoring, pass@k, generation-based reasoning evaluation의 차이를 구분한다. reasoning model에서는 chain-of-thought를 생성하게 한 뒤 최종 답을 robust하게 추출해야 하므로, 평가 format 자체가 bottleneck이 되기 쉽다.

MT-Bench류의 LLM-as-judge는 open-ended 답변 평가에 유용하지만, judge model의 bias와 prompt sensitivity를 그대로 가져온다. 그래서 judge score만으로 ship decision을 내리면 안 된다.

5. Benchmark score는 contamination과 함께 읽는다

증류는 synthetic data를 많이 쓰기 때문에 benchmark contamination에 취약하다.

teacher가 benchmark를 이미 봤을 수 있다.
teacher output dataset에 benchmark prompt가 섞였을 수 있다.
student가 reasoning이 아니라 benchmark format을 외웠을 수 있다.

그래서 benchmark 결과를 볼 때는 원본 score만 보지 말고, decontamination과 perturbation test를 같이 본다.

원본 MATH 점수는 높은데 숫자만 바꾼 perturbation에서 무너지는가?
HumanEval prompt와 training prompt 사이 overlap이 높은가?
LiveCodeBench처럼 시간 기반 contamination을 줄인 benchmark에서도 좋아지는가?

공개 benchmark는 방향을 보여주는 도구이지, 배포 품질의 완전한 증거가 아니다.

6. 비교 대상은 base student, teacher, deployment budget이다

distillation 결과는 세 모델과 비교해야 한다.

base student: 증류가 실제로 개선했는가?
teacher: 얼마나 많은 gap이 남았는가?
deployment target: latency와 cost가 충분히 줄었는가?

student가 teacher보다 낮은 것은 자연스럽다. 중요한 것은 teacher 대비 몇 점 낮은지가 아니라, base student 대비 task success가 충분히 오르고, serving cost가 충분히 내려갔는지다.

예를 들어 작은 distilled model이 수학에서는 크게 좋아졌지만 coding에서는 약하다면, 그것은 실패라기보다 domain-specific distillation의 경계를 보여준다. 반대로 product goal이 general coding assistant라면 그 경계는 ship blocker가 된다.

운영 recipe

실제로는 다음 순서로 시작하면 된다.

1. target capability와 deployment constraint를 쓴다.
2. teacher model, license, version, generation setting을 고정한다.
3. training prompt source와 evaluation source를 분리한다.
4. teacher output을 여러 개 생성하고 verifier, judge, rule filter로 거른다.
5. reasoning trace와 final answer format을 고정한다.
6. response token에 loss를 걸어 작은 baseline run을 먼저 돌린다.
7. held-out validation loss와 generation sample을 같이 본다.
8. task benchmark에서 base student, distilled student, teacher를 비교한다.
9. perturbation, OOD, long rollout, safety regression을 확인한다.
10. ship, more data, bigger student, on-policy loop, stop 중 하나를 결정한다.

결정 기준

distillation project는 다음 중 하나로 끝나야 한다.

ship:
  base student보다 task success가 좋아지고, cost/latency가 목표 안에 들어온다.

more data:
  validation loss는 좋아지지만 failure cluster가 특정 data gap으로 모인다.

bigger student:
  데이터 품질은 좋은데 capacity gap 때문에 broad task가 무너진다.

on-policy loop:
  offline loss는 좋은데 student rollout에서 exposure bias가 반복된다.

stop:
  teacher output license, data lineage, contamination, safety risk를 해결할 수 없다.

이렇게 보면 모델 증류는 단순한 compression trick이 아니다.

증류는 teacher behavior를 student가 감당할 수 있는 데이터, loss, evaluation, deployment budget 안으로 옮기는 engineering workflow다.

참고 자료

DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Lianmin Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Dan Hendrycks et al., Measuring Mathematical Problem Solving With the MATH Dataset
Mark Chen et al., Evaluating Large Language Models Trained on Code
로컬 참고: reference-books/rlhf-book/ch16-Evaluation.md
로컬 참고: reference-books/reasoning-model-from-scratch/ch08-Distilling-reasoning-models-for-efficient-reasoning.md
로컬 참고: reference-books/deepseek from scratch/ch08-Knowledge-distillation-Making-powerful-models-practical.md
로컬 참고: reference-books/AI engineering/ch08.md

확인

validation loss가 내려갔는데 task benchmark가 오르지 않을 수 있는 이유는 무엇인가?
distillation evaluation에서 base student와 teacher를 모두 비교해야 하는 이유는 무엇인가?
offline distillation 결과가 좋을 때도 on-policy loop가 필요해지는 신호는 무엇인가?