nano-vLLM Sampling Kernel Candidate

마지막 수정: 2026년 6월 30일

inferencenano-vllmsamplingkernel

Sampling은 attention보다 작아 보이지만 serving에서는 자주 호출된다.

Decode 단계에서는 매 token마다 logits processing과 token selection이 일어난다.

logits
  -> temperature
  -> top-k / top-p
  -> sample or argmax
  -> next token

처음에는 greedy argmax로 충분하다. 하지만 top-k/top-p, repetition penalty, structured decoding이 붙으면 sampling 경로도 무시하기 어려워질 수 있다.

이 카드는 sampling이 진짜 병목인지 profiler로 확인한 뒤 kernel 후보로 올릴지 결정하는 카드다.

확인