PyTorch Distributed Profiler

Distributed training이 돌아간다고 해서 빨라진 것은 아니다.

확인해야 할 질문은 네 가지다.

single-GPU 대비 step time이 얼마나 줄었는가?
GPU memory는 rank별로 얼마나 쓰는가?
communication time은 얼마나 큰가?
communication과 backward compute가 실제로 overlap되는가?

DDP에서는 gradient all-reduce가 backward 중에 실행된다. profiler에서는 compute kernel 사이에 NCCL communication이 어떻게 배치되는지 본다.

backward layer 4
NCCL all-reduce bucket A
backward layer 3
NCCL all-reduce bucket B

FSDP에서는 parameter all-gather와 reduce-scatter가 추가된다. 따라서 memory가 줄어드는 대신 communication pattern이 바뀐다.

최종 리포트는 다음 형식으로 남긴다.

configuration
GPU count
global batch / microbatch
step time
tokens/sec
peak memory per GPU
communication bottleneck

확인