PyTorch RMSNorm Benchmark Baseline

마지막 수정: 2026년 6월 30일

pytorchrmsnormbenchmarkprofiling

RMSNorm은 profiling path의 첫 실습으로 좋다.

이유는 작고 통제 가능하기 때문이다.

input:  [B, T, D]
output: [B, T, D]
work:   row-wise reduction + elementwise scale

먼저 PyTorch reference를 기준선으로 둔다.

def rmsnorm_ref(x, weight, eps=1e-6):
    inv_rms = torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + eps)
    return weight * x * inv_rms

측정할 것은 세 가지다.

1. standalone RMSNorm latency
2. Transformer block 안에서 RMSNorm이 차지하는 비율
3. dtype과 hidden size에 따른 변화

이 baseline이 있어야 custom CUDA RMSNorm을 연결했을 때 kernel-level speedup과 end-to-end speedup을 구분할 수 있다.

확인

RMSNorm benchmark에서 standalone latency와 end-to-end step time을 둘 다 봐야 하는 이유는 무엇인가?
hidden size가 커질수록 RMSNorm의 reduction 비용은 어떻게 달라지는가?
PyTorch reference는 custom CUDA kernel 비교에서 어떤 역할을 하는가?