vLLM의 KV cache 연결 흐름

vLLM에서 KV cache는 단순히 “model 안에 저장된 tensor”가 아니다.

EngineCore가 요청별 KV block을 관리하고, GPU worker가 그 block 정보를 device tensor와 kernel metadata로 바꿔 model forward에 연결한다.

큰 흐름은 이렇게 볼 수 있다.

EngineCore / Scheduler
-> KVCacheManager.allocate_slots(...)
-> SchedulerOutput
-> GPUModelRunner
-> BlockTables
-> slot mapping / attention metadata
-> attention backend

이 카드는 “누가 무엇을 책임지는가”를 잡기 위한 카드다.

EngineCore가 하는 일

EngineCore의 scheduler는 매 step마다 어떤 request를 얼마나 진행할지 결정한다.

각 request에 대해 중요한 값은 대략 두 가지다.

num_computed_tokens:
이미 계산이 끝나 KV cache에 반영된 token 수

num_tokens_with_spec:
이번에 계산 대상으로 볼 수 있는 token 수

scheduler는 이 차이를 보고 이번 step에서 새로 계산할 token 수를 정한다.

num_new_tokens = num_tokens_with_spec - num_computed_tokens

그 다음 KV cache 공간을 확보해야 한다.

KVCacheManager.allocate_slots(request, num_new_tokens, ...)

여기서 block을 확보하지 못하면 해당 request는 이번 step에 계속 진행될 수 없다. 즉 scheduler의 admission, preemption, waiting 판단은 KV cache 여유 공간과 직접 연결된다.

KVCacheManager가 하는 일

KVCacheManager는 EngineCore 쪽의 KV cache allocator다.

역할을 단순화하면 다음과 같다.

1. request가 이미 가진 KV block을 추적한다.
2. 새 token을 저장할 block이 필요한지 계산한다.
3. free block pool에서 block을 배정한다.
4. request가 끝나거나 밀려나면 block을 회수한다.
5. worker로 넘길 block id 목록을 만든다.

vLLM 구현에서는 KVCacheBlocks 같은 객체가 scheduler와 KVCacheManager 사이의 인터페이스 역할을 한다. scheduler는 복잡한 내부 allocator 구조를 직접 만지기보다 “이번 step에 새로 배정된 block id들”을 결과로 받는다.

이 block id들이 EngineCore와 GPU worker를 잇는 핵심 metadata다.

SchedulerOutput이 건네는 것

EngineCore는 계산을 직접 실행하지 않는다. 대신 worker가 실행할 수 있도록 scheduling 결과를 만든다.

KV cache 관점에서 중요한 정보는 이런 것들이다.

각 request가 이번 step에 몇 token을 계산할지
이번 step에 새로 배정된 KV block id는 무엇인지
초기화해야 하는 새 block id가 있는지
speculative decode token이 있다면 어느 범위인지

이 정보가 SchedulerOutput 형태로 GPU worker에 전달된다.

여기서 중요한 구분은 다음과 같다.

EngineCore:
정책과 자원 소유권을 관리한다.

GPU worker:
실제 device tensor와 kernel 입력을 준비한다.

GPU worker가 하는 일

GPU worker는 model forward를 실행한다. 하지만 forward를 실행하기 전에 EngineCore에서 받은 block id를 worker 내부의 block table에 반영해야 한다.

개념적으로는 이런 일이 일어난다.

for each scheduled request:
  request의 worker-side row를 찾는다
  새로 받은 block id를 BlockTables에 append한다
  필요하면 새 block을 zero-fill한다

worker는 request별 block table을 유지한다.

request row 0: [17, 04, 91, ...]
request row 1: [33, 02, ...]
request row 2: [08, 77, 10, ...]

이 table은 “이 request의 logical KV block들이 물리적으로 어느 KV cache block에 있는가”를 나타낸다.

Slot mapping은 왜 필요한가

block table은 request 단위의 지도다. 하지만 GPU kernel은 token 단위로 새 K/V를 써야 한다.

그래서 worker는 이번 step에 처리할 token마다 “이 token의 K/V를 KV cache의 어느 slot에 쓸 것인가”를 계산한다.

token position -> logical block index + offset
logical block index -> physical block id
physical block id + offset -> KV cache slot

이 결과가 slot mapping이다.

block table:
request의 logical block을 physical block id로 바꾼다.

slot mapping:
이번 step의 각 token을 실제 KV cache write 위치로 바꾼다.

decode에서는 새 token의 K/V를 이 slot에 저장하고, attention은 block table을 따라 과거 K/V를 읽는다.

Forward 직전의 metadata

worker는 model forward 전에 다음 입력들을 맞춰야 한다.

input ids:
이번 step에서 실제로 model에 넣을 token들

positions:
각 token의 sequence position

sequence lengths:
request별 현재 길이

block tables:
request별 logical block -> physical block mapping

slot mappings:
이번 step token별 KV cache write 위치

이 metadata가 attention backend로 넘어가면, kernel은 연속 tensor 하나를 가정하지 않고 block table을 사용해 KV cache를 읽고 쓴다.

정리

vLLM의 KV cache 연결 흐름은 이렇게 기억하면 된다.

EngineCore는 block을 배정한다.
GPU worker는 block id를 block table과 slot mapping으로 바꾼다.
Attention backend는 그 metadata를 사용해 KV cache를 읽고 쓴다.

따라서 “KV cache 관리는 EngineCore가 하나요, GPU worker가 하나요?”라는 질문의 답은 둘 다다.

다만 책임이 다르다.

EngineCore:
어떤 request에 어떤 KV block을 줄지 결정한다.

GPU worker:
그 결정을 실제 GPU tensor layout과 kernel 입력으로 구체화한다.

코드 읽기 지도

vLLM 코드를 읽을 때는 이 순서로 보면 좋다.

vllm/v1/core/sched/scheduler.py
  schedule step과 KV block 할당 요청

vllm/v1/core/kv_cache_manager.py
  request별 KV block allocation/free

vllm/v1/worker/gpu/model_runner.py
  SchedulerOutput을 받아 worker-side state 갱신

vllm/v1/worker/gpu/block_table.py
  block table, staged write, slot mapping 준비

이 흐름을 잡고 나면 speculative decoding이나 prefix caching도 “새로운 기능”이 아니라 “이 block allocation과 worker metadata 흐름에 어떤 제약을 추가하는가”로 읽을 수 있다.

연결

inference-engine-layers: API Server, EngineCore, GPU worker로 나눠 보는 기본 지도
kv-cache: EngineCore와 worker가 함께 관리하는 자원
paged-attention: block table이 필요한 이유

확인

EngineCore가 KV cache와 관련해 직접 하는 일은 무엇인가?
GPU worker가 EngineCore에서 받은 block id를 그대로 kernel에 넘기지 않고 block table과 slot mapping으로 바꾸는 이유는 무엇인가?
block table과 slot mapping의 차이는 무엇인가?
vLLM에서 KV cache 관리를 “EngineCore만의 일” 또는 “GPU worker만의 일”이라고 말하면 왜 부정확한가?