PyTorch Transformer 학습하기

이 경로는 직접 작성한 PyTorch Transformer를 실제로 학습시키는 경로입니다.

목표는 모델 코드를 만든 뒤 멈추지 않고, next-token prediction 문제로 바꾸고, loss가 내려가는 단일 GPU 학습 루프를 만드는 것입니다. 이후 gradient accumulation, mixed precision, activation/optimizer memory를 붙여 multi-GPU 학습으로 넘어갈 준비를 합니다.

이 path의 기준선이 있어야 나중에 DDP, FSDP, ZeRO를 비교할 수 있습니다.

model forward
  -> next-token loss
  -> backward
  -> optimizer step
  -> memory / throughput baseline

PyTorch Decoder-only Transformer 조립 — embedding, 여러 decoder block, final norm, LM head를 묶어 작은 language model을 완성한다.

Next-token Loss와 Batch 만들기 — token sequence를 input과 target으로 한 칸 밀어 language modeling loss를 계산한다.

Single-GPU Training Step Review — 학습 병렬화를 보기 전에 한 GPU에서 한 optimizer step이 어떻게 진행되는지 복습한다.

PyTorch Single-GPU Training Loop — forward, loss, backward, optimizer update, mixed precision을 포함한 단일 GPU 학습 루프를 만든다.

Gradient Accumulation — 작은 micro-batch 여러 번의 gradient를 누적해 큰 batch처럼 업데이트한다.

Mixed Precision Training — 학습에서 모든 값을 낮은 precision으로 바꾸지 않고, 계산과 상태마다 precision을 다르게 쓰는 이유를 이해한다.

Training Memory Overview — 학습 메모리를 parameters, activations, gradients, optimizer states로 나눠 본다.