Model Distillation 큰 그림

모델 증류를 처음 볼 때 가장 흔한 오해는 “큰 모델을 작은 모델로 압축한다”는 말이다.

그 말이 완전히 틀린 것은 아니지만, 더 정확한 표현은 이것이다.

Model distillation transfers useful behavior.

teacher의 weight를 그대로 줄이는 것이 아니다. teacher가 입력을 보고 어떤 출력을 내는지, 어떤 후보를 그럴듯하게 보는지, 어떤 풀이 패턴을 반복하는지를 student가 배우게 만든다.

large, capable, expensive

Teacher

frontier model
ensemble
reasoning model

behavior made trainable

Teaching signal

soft targets
answers
reasoning traces

smaller, cheaper, deployable

Student

small LLM
edge model
specialist model

Not copied weights, exact internals, full capacity

Transferred useful behavior, uncertainty, solution patterns

Distillation is best read as behavior transfer. The student does not need to become the teacher; it needs to preserve enough useful behavior for the deployment target.

왜 필요한가

큰 모델은 많은 일을 잘하지만, 모든 제품 요청마다 가장 큰 모델을 부르는 것은 비싸고 느리다.

frontier teacher
  -> strong but expensive

student model
  -> weaker but cheaper, faster, easier to deploy

증류는 이 둘 사이의 경제적 질문에서 출발한다.

큰 모델이 이미 발견한 능력을
작은 모델이 다시 처음부터 발견하지 않고
학습 데이터로 물려받을 수 있을까?

고전적인 지식 증류에서는 teacher가 만든 확률분포를 student가 맞춘다. 예를 들어 정답이 cat이어도 teacher가 dog와 fox를 조금 높게 본다면, student는 단순히 “cat이 정답”만 배우는 것이 아니라 “dog가 car보다 더 가까운 오답”이라는 구조도 배운다.

LLM에서는 이 구조가 더 넓어진다. teacher가 만든 답변, 코드, reasoning trace, tool call, preference label, critique가 모두 teaching signal이 될 수 있다.

압축되는 것은 weight가 아니다

증류의 대상은 보통 teacher의 내부 weight가 아니다.

아님: teacher weights -> smaller copied weights
맞음: teacher behavior -> cheaper behavior approximation

그래서 teacher와 student의 architecture가 달라도 증류가 가능하다. DeepSeek-R1-Distill 계열도 R1의 parameter를 잘라낸 모델이 아니라, Qwen/Llama 계열 base model에 R1이 만든 reasoning data를 학습시킨 쪽에 가깝다.

이 관점이 중요하다. 증류는 quantization처럼 같은 모델의 숫자 표현을 줄이는 기술과 다르다. pruning처럼 layer나 width를 잘라내는 기술과도 다르다. 증류는 “무엇을 보존할 것인가”를 데이터와 loss로 정하는 학습 절차다.

세 가지 질문

모델 증류를 볼 때는 항상 세 질문으로 나누면 된다.

1. Teacher는 무엇을 알고 있나?
2. Student는 어떤 신호를 받을 수 있나?
3. 배포 환경에서 무엇을 보존해야 하나?

첫 번째 질문은 teacher 선택이다. 수학 teacher, coding teacher, general assistant teacher는 서로 다른 행동을 준다.

두 번째 질문은 teaching signal 선택이다.

soft logits
teacher-generated answers
reasoning traces
preference labels
tool-use trajectories
critique / correction

세 번째 질문은 evaluation 선택이다. student가 teacher와 비슷하게 말하는지만 보면 안 된다. 실제 task accuracy, latency, cost, safety, domain robustness를 같이 봐야 한다.

이 path의 지도

이 path는 모델 증류를 다음 순서로 본다.

big picture
-> teacher-student와 dark knowledge
-> hard distillation vs soft distillation
-> temperature와 KL loss
-> LLM sequence distillation
-> reasoning distillation
-> DeepSeek-R1 distillation case
-> offline vs on-policy distillation
-> failure modes와 evaluation recipe

첫 번째 카드에서 기억할 문장은 하나다.

Training discovers capability.
Distillation distributes capability.

참고 자료

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, Distilling the Knowledge in a Neural Network
Victor Sanh et al., DistilBERT, a distilled version of BERT
DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
로컬 참고: reference-books/deepseek from scratch/ch08-Knowledge-distillation-Making-powerful-models-practical.md
로컬 참고: reference-books/rlhf-book/ch12-Synthetic-Data.md

확인

모델 증류에서 보통 압축되는 것은 weight인가, behavior인가?
teacher-generated answer만으로 학습하는 방식과 logits를 맞추는 방식은 어떤 점이 다른가?
student가 teacher와 비슷하게 말하는 것만으로 충분하지 않은 이유는 무엇인가?