nano-vLLM Attention Kernel Swap

마지막 수정: 2026년 6월 30일

inferencenano-vllmattentionkernel

Attention kernel swap은 nano-vLLM 최적화의 중심 실험이다.

비교 대상은 세 가지다.

1. manual attention
2. PyTorch SDPA
3. optimized attention backend

반드시 prefill과 decode를 나눠 측정한다.

prefill:
prompt 전체 attention, TTFT에 영향

decode:
KV cache read 중심, TPOT에 영향

이 실험은 vLLM의 PagedAttention을 이해하기 위한 전 단계다. 작은 엔진에서 attention 교체 지점을 경험해야 production vLLM backend를 읽을 때 책임 경계가 보인다.

확인