Autoregressive vs Diffusion Decoding

Autoregressive decoding과 diffusion decoding은 생성 루프의 기본 단위가 다르다.

Autoregressive decoding

일반 decoder-only LLM은 output을 왼쪽에서 오른쪽으로 만든다.

step 1: token A 생성
step 2: token B 생성
step 3: token C 생성

매 step에서 새 token이 request 뒤에 붙는다.

tokens += [sampled_token]

이 구조에서는 KV cache도 자연스럽게 append된다.

새 token의 K/V를 cache 뒤에 붙인다.

dLLM은 고정 길이 canvas를 놓고, 여러 위치를 반복적으로 갱신한다.

canvas step 1: [?, a, ?, ?, b]
canvas step 2: [c, a, ?, d, b]
canvas step 3: [c, a, e, d, b]

이때 중요한 차이는 한 step이 “새 token append”가 아니라는 점이다.

autoregressive:
append one token

diffusion:
update canvas positions

Autoregressive serving loop는 다음 질문을 중심으로 돈다.

이번 step에서 어떤 request가 next token을 만들 것인가?
생성된 token을 어디에 append할 것인가?
KV cache를 어떻게 늘릴 것인가?

dLLM serving loop는 질문이 바뀐다.

이번 step은 denoise인가 commit인가?
몇 개의 canvas position을 다시 평가할 것인가?
어떤 token을 확정해서 사용자에게 내보낼 것인가?

이 차이가 dLLM을 serving system 관점에서 흥미롭게 만든다.