vLLM GPU Worker Execution Contract

마지막 수정: 2026년 6월 30일

inferencevllmgpu-workermodel-runner

GPU worker는 engine core의 결정을 실제 GPU 실행으로 바꾼다.

Worker 쪽에서 일어나는 일은 크게 네 단계다.

1. scheduler output 해석
2. input ids / positions / sequence lengths 준비
3. block table / slot mapping / attention metadata 준비
4. model forward + sampling 실행

이 경계에서 데이터 형식이 바뀐다.

Engine core level:
request id, token count, block id

Worker level:
torch tensor, device tensor, attention metadata, sampling tensor

GPU worker를 읽을 때는 “어떤 kernel을 쓰는가”보다 먼저 “engine core의 추상적인 scheduling 결과가 어떤 tensor contract로 바뀌는가”를 봐야 한다.

확인

Worker가 positions와 sequence lengths를 준비해야 하는 이유는 무엇인가?
Block id 목록은 왜 attention kernel이 바로 쓰기 어려운가?
Worker level에서 request data가 tensor data로 바뀌는 지점은 어디인가?

vLLM GPU Worker Execution Contract

확인

연결된 카드