Weights, Activations, KV Cache Quantization

Quantization은 하나의 스위치가 아니다.

LLM inference에서 낮은 bit로 바꿀 수 있는 대상은 크게 세 가지다.

W: weights
A: activations
KV cache: past keys and values

어떤 tensor를 낮추느냐에 따라 얻는 이득과 감수하는 위험이 달라진다.

Weights

Frozen after training

Primary gain: model size + bandwidth
Main risk: usually lower

Activations

Fresh every forward pass

Primary gain: compute throughput
Main risk: outliers + calibration

KV cache

Written once, read many times

Primary gain: long context + concurrency
Main risk: attention quality

Weight quantization is the usual first move. Activation quantization matters when the runtime can execute low-precision compute. KV cache quantization matters most when context length or concurrency makes cache memory the bottleneck.

Weight quantization

Weight는 학습이 끝나면 고정된다. 그래서 inference 전에 offline으로 분포를 보고 scale을 정할 수 있다.

frozen weights
-> offline analysis / calibration
-> smaller checkpoint
-> fewer bytes loaded per forward pass

이게 weight-only quantization이 가장 흔한 출발점인 이유다. W8A16, W4A16 같은 표기는 weight는 8-bit 또는 4-bit로 저장하지만 activation과 실제 accumulate는 더 높은 precision을 쓴다는 뜻이다.

중요한 점은 weight-only가 주로 memory footprint와 bandwidth를 줄인다는 것이다. 만약 kernel이 weight를 dequantize해서 FP16 matmul을 한다면 compute 자체가 낮은 precision으로 실행되는 것은 아니다.

Activation quantization

Activation은 매 input마다 새로 생긴다. 같은 layer라도 prompt나 batch에 따라 range가 달라질 수 있다.

input-dependent range
outlier channels
calibration mismatch
runtime scale choice

그래서 activation quantization은 어렵다. 하지만 잘 되면 이득도 크다. Weight와 activation이 둘 다 낮은 precision이면 W8A8, FP8 W8A8처럼 GEMM 자체를 낮은 precision kernel로 실행할 수 있다.

즉 activation quantization의 핵심 질문은 이것이다.

메모리만 줄일 것인가?
아니면 matmul compute도 낮은 precision으로 실행할 것인가?

후자를 원하면 hardware와 runtime이 해당 format의 kernel을 지원해야 한다.

KV cache quantization

KV cache는 weight도 activation도 아닌 중간 성격을 가진다.

token이 들어올 때 K/V를 한 번 쓴다
이후 decode step마다 과거 K/V를 계속 읽는다
context length와 batch가 커질수록 cache가 커진다

따라서 KV cache quantization은 주로 long context와 serving concurrency 문제다.

KV cache를 작게 만들면 GPU memory 안에 더 많은 prefix, 더 긴 context, 더 많은 요청을 담을 수 있다. 하지만 attention 계산이 여전히 high precision이면 latency가 크게 줄지 않을 수 있다. 이 경우 이득은 compute speedup보다 memory capacity와 throughput 여유에 가깝다.

같은 bit-width라도 target이 다르면 뜻이 다르다

INT4라는 말만 보면 부족하다. 무엇이 INT4인지 물어야 한다.

W4A16: weight-only compression
W8A8: weight and activation low-precision compute
FP8 KV cache: cache capacity and long-context pressure

이 구분이 없으면 “양자화했는데 왜 안 빨라졌지?” 같은 혼란이 생긴다.

다음으로 볼 것

세 target은 뒤에서 다른 방법으로 이어진다.

weights     -> GPTQ, AWQ, group-wise INT4
activations -> SmoothQuant, dynamic/per-token quantization, FP8
KV cache    -> key/value별 granularity, quantized attention kernels

이 path에서는 먼저 target을 구분하고, 그다음 각 target에 맞는 granularity와 방법을 본다.

확인

weight-only quantization은 왜 가장 먼저 시도하기 쉬운가?
activation quantization이 compute speedup과 더 직접적으로 연결되는 이유는 무엇인가?
KV cache quantization이 long-context serving에서 중요해지는 이유는 무엇인가?