Reasoning Distillation

LLM sequence distillation은 teacher가 만든 완성된 답변을 student의 target sequence로 삼는다.

reasoning distillation은 여기서 한 단계 더 구체적이다.

답만 따라 쓰게 하는 것이 아니라,
답에 도달하는 풀이 절차까지 target sequence에 넣는다.

그래서 reasoning distillation의 핵심 신호는 final answer 하나가 아니라 reasoning trace + final answer다.

Problem

A math or code task with a checkable answer

Role: context
Loss: masked

Reasoning trace

<think>decompose, compute, verify...</think>

Role: procedure target
Loss: trained

Final answer

Therefore, the answer is ...

Role: outcome target
Loss: trained

Answer-only SFT final result can miss the method

Reasoning distill method + result can copy teacher habits

RL with verifier rewarded outcome can discover new traces

Reasoning distillation uses the teacher's intermediate solution trace as part of the target sequence. The student is not only learning the answer; it is learning a response protocol for getting to the answer.

왜 reasoning trace가 필요한가

수학, 코드, 논리 문제에서는 최종 답만 보고는 무엇을 배워야 하는지 부족할 때가 많다.

prompt:  복잡한 수학 문제
target:  42

이런 데이터는 student에게 “이 prompt에서는 42를 말하라”는 신호는 줄 수 있다. 하지만 문제를 어떻게 분해했는지, 어떤 식을 세웠는지, 중간 결과를 어떻게 검증했는지는 거의 주지 못한다.

reasoning trace가 들어가면 target이 달라진다.

prompt:  복잡한 수학 문제
target:  문제를 나누고 -> 식을 세우고 -> 계산하고 -> 검산하고 -> 최종 답을 말한다

student는 정답 token만 맞추는 것이 아니라, teacher가 보여 준 문제 해결 protocol을 next-token prediction으로 따라 배운다.

CoT distillation의 형태

실제 데이터셋에서는 reasoning trace와 final answer를 분리해서 저장한 뒤, 학습용 target으로 합치는 방식이 흔하다.

problem:          user prompt
message_thinking: teacher reasoning trace
message_content:  teacher final answer

training target:
<think>{message_thinking}</think>

{message_content}

<think> 같은 tag는 distillation 자체에 꼭 필요한 것은 아니다. 하지만 reasoning 구간과 final answer 구간을 명확히 분리하면 학습, 평가, UI 처리, verifier parsing이 쉬워진다.

중요한 것은 loss가 prompt가 아니라 teacher response 쪽에 걸린다는 점이다.

[problem tokens][reasoning trace tokens][final answer tokens]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                student가 imitate해야 하는 구간

무엇이 전이되는가

reasoning distillation은 지식을 “정답 목록”으로만 옮기지 않는다. 더 정확히는 다음 행동 양식을 옮긴다.

문제를 하위 문제로 쪼개기
필요한 변수와 식을 세우기
중간 계산을 유지하기
답 형식을 지키기
마지막에 결과를 검증하기

DeepSeek-R1류 distillation이 흥미로운 이유도 여기에 있다. 큰 teacher는 RL과 rejection sampling을 통해 강한 reasoning trace를 만들고, 작은 student는 그 trace를 SFT로 모방한다. 작은 모델이 큰 teacher의 탐색 비용을 그대로 다시 치르지 않아도 되는 것이다.

답만 맞추는 것과는 다르다

reasoning distillation과 RLVR은 서로 다른 질문에 답한다.

RLVR:
  student가 직접 여러 풀이를 생성하고,
  verifier가 final answer를 채점한다.

reasoning distillation:
  teacher가 이미 만든 좋은 풀이를 저장하고,
  student가 그 풀이 sequence를 모방한다.

RL은 새로운 행동을 발견하거나 강화할 수 있다. 하지만 sampling, reward, policy update가 필요해 비용과 불안정성이 크다.

distillation은 teacher가 이미 가진 행동을 더 싸게 퍼뜨린다. 하지만 teacher가 만들지 못한 새로운 reasoning style을 student가 스스로 발견하게 하지는 않는다.

위험한 부분

reasoning trace를 넣는다고 항상 reasoning 능력이 생기는 것은 아니다.

teacher trace가 틀리면 student도 틀린 절차를 배운다.
trace가 장황하면 student도 불필요하게 장황해질 수 있다.
형식이 뒤섞이면 student의 답변 protocol도 흔들린다.
student capacity가 부족하면 풀이처럼 보이는 문장만 흉내 낼 수 있다.

그래서 reasoning distillation에서는 데이터 품질 관리가 핵심이다. final answer correctness, trace readability, format consistency, domain balance를 같이 봐야 한다.

정리하면, reasoning distillation의 단위는 “정답”이 아니라 “검증 가능한 풀이가 포함된 답변 sequence”다.

참고 자료

Jason Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Eric Zelikman et al., STaR: Bootstrapping Reasoning With Reasoning
DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
로컬 참고: reference-books/reasoning-model-from-scratch/ch08-Distilling-reasoning-models-for-efficient-reasoning.md
로컬 참고: reference-books/deepseek from scratch/ch07-Reinforcement-learning-From-policy-gradients-to-GRPO.md
로컬 참고: reference-books/deepseek from scratch/ch08-Knowledge-distillation-Making-powerful-models-practical.md
로컬 참고: reference-books/AI engineering/ch08.md

확인

reasoning distillation은 final answer만 학습하는 방식과 무엇이 다른가?
<think>...</think> 같은 구분자가 실용적으로 도움 되는 이유는 무엇인가?
reasoning distillation이 RLVR을 완전히 대체하지 못하는 이유는 무엇인가?