Attention Workload: Prefill vs Decode

마지막 수정: 2026년 6월 30일

pytorchattentioninferenceprofiling

Attention benchmark는 하나의 shape만 보면 부족하다.

LLM 추론에서는 prefill과 decode가 다르게 보인다.

Prefill-like

Prefill은 prompt 전체를 한 번에 처리한다.

Q: [B, H, T, Dh]
K: [B, H, T, Dh]
V: [B, H, T, Dh]
scores: [B, H, T, T]

이 workload에서는 score matrix와 softmax 비용이 크게 보인다. sequence length T가 커질수록 attention matrix가 T x T로 커진다.

Decode는 새 token 하나가 기존 KV cache를 본다.

Q: [B, H, 1, Dh]
K: [B, H, T, Dh]
V: [B, H, T, Dh]
scores: [B, H, 1, T]

이 workload에서는 KV cache read와 memory bandwidth가 중요해진다. vLLM의 KV cache manager와 PagedAttention 문제도 이 지점에서 시작된다.