Pipeline Parallelism from Picotron

이 카드는 PP를 개념으로 다시 설명하지 않는다. 목표는 Picotron 코드에서 PP training step을 읽을 수 있게 되는 것이다.

Pipeline Parallelism Split layers into stages, then stream microbatches through the pipeline.

16 layers -> 4 pipeline stages

Stage 0 GPU 0 L0 L1 L2 L3

Stage 1 GPU 1 L4 L5 L6 L7

Stage 2 GPU 2 L8 L9 L10 L11

Stage 3 GPU 3 L12 L13 L14 L15

Forward activation moves to the next stage S0 -> act -> S1 -> act -> S2 -> act -> S3

Backward activation gradient moves back to the previous stage S3 -> grad act -> S2 -> grad act -> S1 -> grad act -> S0

Naive PP: one microbatch only one stage works at a time

Stage 0FB Stage 1FB Stage 2FB Stage 3FB

forward backward bubble

AFAB: all forward, all backward microbatches reduce bubble, but forward activations pile up

Stage 0F0F1F2F3B3B2B1B0 Stage 1F0F1F2F3B3B2B1B0 Stage 2F0F1F2F3B3B2B1 Stage 3F0F1F2F3B3B2

Bubble ratio (p - 1) / m

Activation pile-up m forward activations wait for backward

1F1B: one forward, one backward warm up, then alternate forward and backward to release activations earlier

warmup steady 1F1B cooldown

Stage 0F0F1F2F3B0F4B1F5B2B3B4B5 Stage 1F0F1F2B0F3B1F4B2F5B3B4 Stage 2F0F1B0F2B1F3B2F4B3F5 Stage 3F0B0F1B1F2B2F3B3F4

Earlier release after Bi, activation i can be freed

Still has bubbles warmup + cooldown remain

Per microbatch activations / PP each stage owns fewer layers

Before first backward PP x (activations / PP) ~= activations several microbatches are stored at once

Next idea 1F1B start backward earlier to free activations sooner

Interleaved 1F1B: virtual stages split each physical stage into smaller layer chunks to reduce remaining bubbles

GPU 0 V0: L0-L1 V4: L8-L9

GPU 1 V1: L2-L3 V5: L10-L11

GPU 2 V2: L4-L5 V6: L12-L13

GPU 3 V3: L6-L7 V7: L14-L15

GPU 0F0@V0F1@V0F0@V4B0@V4F2@V0B0@V0F1@V4B1@V4 GPU 1F0@V1F1@V1F0@V5B0@V5F2@V1B0@V1F1@V5 GPU 2F0@V2F1@V2F0@V6B0@V6F2@V2B0@V2 GPU 3F0@V3F1@V3F0@V7B0@V7F2@V3

Benefit smaller chunks -> tighter schedule

Cost more stage boundaries and sends

PP reduces parameter memory by splitting layers, but scheduling determines bubble size and activation memory.

Picotron의 PipelineParallel은 PP의 핵심을 한눈에 보여준다.

self.layer_distribution = self.distribute_layers(config.num_hidden_layers)
self.embedding = model.embedding if pp_is_first_stage else nn.Identity()
self.decoder_layers = nn.ModuleDict(...)
self.final_norm = model.final_norm if pp_is_last_stage else nn.Identity()
self.final_proj = model.final_proj if pp_is_last_stage else nn.Identity()

여기서 중요한 것은 stage마다 같은 모델을 복제하지 않는다는 점이다. 각 rank는 자기 stage가 맡은 layer만 실제 module로 들고, 앞뒤 stage와 activation tensor를 주고받는다.

forward:  previous stage -> hidden_states -> local layers -> next stage
backward: next stage -> grad(hidden_states) -> local backward -> previous stage

Picotron의 PP는 세 부분으로 나뉜다

첫째, layer ownership이다.

layers_per_gpu = [
  num_layers // pp_world_size + extra
  for i in range(pp_world_size)
]
start_layer = sum(layers_per_gpu[:pp_rank])

이 코드는 PP rank가 어떤 decoder layer를 소유할지 결정한다. 첫 stage만 embedding을 갖고, 마지막 stage만 final norm과 projection을 갖는다.

둘째, stage 사이 P2P 통신이다.

recv_forward  # previous stage에서 activation 받기
send_forward  # next stage로 activation 보내기
recv_backward # next stage에서 activation grad 받기
send_backward # previous stage로 activation grad 보내기

Picotron은 dist.batch_isend_irecv로 send/recv를 실행한다. TP처럼 모든 rank가 같은 collective에 들어가는 방식이 아니라, 인접 stage끼리 tensor를 넘기는 방식이다.

셋째, microbatch schedule이다.

AFAB: all forward -> all backward
1F1B: warmup -> forward/backward alternating -> cooldown

AFAB는 가장 읽기 쉽다

Picotron의 train_step_pipeline_afab는 모든 microbatch forward를 먼저 돈다.

for _ in range(grad_acc_steps):
    input_tensor = recv_forward()
    output_tensor = model.forward(...)
    send_forward(output_tensor)
    input_tensors.append(input_tensor)
    output_tensors.append(output_tensor)

그 다음 저장해 둔 tensor를 FIFO로 꺼내 backward를 수행한다.

for ith_microbatch in range(grad_acc_steps):
    output_tensor_grad = recv_backward()
    input_tensor, output_tensor = input_tensors.pop(0), output_tensors.pop(0)
    input_tensor_grad = model.backward(input_tensor, output_tensor, output_tensor_grad)
    send_backward(input_tensor_grad)

AFAB는 debugging에는 좋다. forward region과 backward region이 분리되어 있어서 tensor 흐름이 단순하다. 하지만 모든 forward activation을 오래 들고 있어야 하므로 activation memory가 커진다.

1F1B는 memory pressure를 줄이기 위한 schedule이다

Picotron의 1F1B는 세 phase로 나뉜다.

num_warmup_microbatches = min(pp_world_size - pp_rank - 1, grad_acc_steps)
num_microbatches_remaining = grad_acc_steps - num_warmup_microbatches

초기 stage일수록 더 많은 forward를 먼저 보내야 pipeline이 찬다. 마지막 stage는 거의 바로 backward를 시작할 수 있다.

steady state에서는 forward와 backward 통신을 묶는다.

send_fwd_recv_bwd
send_bwd_recv_fwd

이 구조가 1F1B의 핵심이다. 다음 microbatch activation을 앞으로 보내면서, 이전 microbatch의 activation gradient를 뒤에서 받는다. 그래서 backward가 끝난 microbatch activation을 더 빨리 해제할 수 있다.

AFAB: F0 F1 F2 F3 ... B0 B1 B2 B3
1F1B: warmup 이후 F와 B가 교차한다

Nanotron은 schedule을 engine/state로 분리한다

Nanotron은 같은 개념을 더 framework답게 분리한다.

PipelineEngine
  - AllForwardAllBackwardPipelineEngine
  - OneForwardOneBackwardPipelineEngine

PipelineTrainBatchState
  - microbatches_activations_to_send
  - microbatches_activations_to_recv
  - microbatches_grads_to_send
  - microbatches_grads_to_recv
  - microbatches_activations_requiring_backward

Picotron에서는 train_step_pipeline_1f1b 안에 schedule과 통신이 같이 보인다. Nanotron에서는 model block이 send/recv 요청을 state에 등록하고, engine이 그 queue를 소비한다.

Picotron: schedule function이 직접 recv/send/backward를 호출
Nanotron: PipelineBlock -> PipelineBatchState queue -> PipelineEngine

이 차이는 작은 구현과 실제 framework의 차이다. 교육용으로는 Picotron이 좋고, 확장성을 보려면 Nanotron을 봐야 한다.

Megatron은 production schedule 문제를 다룬다

Megatron의 schedules.py는 같은 1F1B를 훨씬 더 많은 조건과 함께 처리한다.

pp_size == 1: no pipelining
pp_size > 1 and vp_size is None: non-interleaved 1F1B
pp_size > 1 and vp_size exists: interleaved 1F1B

여기서 vp_size는 virtual pipeline이다. 한 GPU가 하나의 큰 stage만 갖는 것이 아니라 여러 작은 model chunk를 갖게 해서 bubble을 줄인다.

Megatron에는 Picotron TODO로 남아 있던 memory 최적화도 실제로 들어 있다.

deallocate_output_tensor(out)

activation을 다음 stage로 보낸 뒤에는 tensor data 자체보다 autograd graph의 grad_fn이 중요하다. Megatron은 output tensor data를 작은 scalar로 바꿔 memory pressure를 줄이고, backward는 custom path로 처리한다.

실습

pipeline_schedule_sim.py는 모델을 실행하지 않고 schedule만 비교한다.

python3 labs/large-scale-training-parallelism/pipeline_schedule_sim.py --pp 4 --microbatches 6

출력에서 .은 bubble이다.

AFAB:
S0: F0 F1 F2 ...
S1:  . F0 F1 ...

1F1B-style:
S0: F0 F1 F2 F3 B0 F4 B1 ...
S1:  . F0 F1 B0 F2 B1 ...

이 실습의 목적은 정확한 Megatron schedule을 복제하는 것이 아니다. 먼저 dependency를 만족하는 pipeline timeline을 눈으로 보고, AFAB와 1F1B가 activation lifetime을 어떻게 다르게 만드는지 이해하는 것이다.

읽는 순서

PP를 코드로 읽을 때는 이 순서가 가장 안정적이다.

Picotron에서 PipelineParallel.distribute_layers로 layer ownership을 확인한다.
pipeline_communicate의 recv_forward/send_forward/recv_backward/send_backward 네 연산을 추적한다.
train_step_pipeline_afab로 tensor 저장과 backward FIFO를 이해한다.
train_step_pipeline_1f1b에서 warmup, steady state, cooldown을 읽는다.
Nanotron에서 같은 구조가 PipelineEngine과 PipelineTrainBatchState로 어떻게 분리되는지 본다.
Megatron에서 virtual pipeline, schedule table, output deallocation이 왜 추가되는지 확인한다.

핵심은 PP를 “layer를 나누는 기술”로만 보면 안 된다는 것이다. 실제 학습에서는 stage 배치, microbatch schedule, activation lifetime, P2P 통신 순서가 한 묶음으로 움직인다.