Post-Training Quantization and Calibration

Post-training quantization, 줄여서 PTQ는 학습이 끝난 모델을 낮은 precision으로 바꾸는 방식이다.

trained model
-> observe weights / activations
-> choose scale and zero-point
-> export quantized model
-> evaluate quality and serving behavior

핵심은 모델을 처음부터 다시 학습하지 않는다는 점이다. 그래서 실무에서는 PTQ가 먼저 시도된다. 모델 weight만 있으면 적용할 수 있고, fine-tuning이나 pretraining pipeline을 다시 열 필요가 없다.

Calibration match

1.1x

Static MSE: 0.000227
Dynamic MSE: 0.000207

similar

Narrow input

11.3x

Static MSE: 0.000192
Dynamic MSE: 0.0000170

dynamic uses range

Wide input

1127x

Static MSE: 1.80
Dynamic MSE: 0.00160

static clips

Outlier-heavy

256x

Static MSE: 1.26
Dynamic MSE: 0.00494

static range breaks

Static calibration is cheap when production data matches calibration data. When activation ranges shift, dynamic quantization can avoid wasted range or catastrophic clipping.

Calibration은 range를 정하는 일이다

Quantization 공식 자체는 간단하다.

q = round(r / S + Z)
r ~= S(q - Z)

하지만 실제로 어려운 질문은 이것이다.

S와 Z를 어떤 데이터 범위에서 정할 것인가?

Calibration은 대표 데이터를 모델에 통과시키면서 weight나 activation의 min, max, percentile, histogram 같은 통계를 모으고, 그 통계로 scale과 zero-point를 정하는 과정이다.

Weight는 이미 고정되어 있으므로 비교적 쉽다. Activation은 input마다 달라지기 때문에 대표 데이터가 중요하다.

Calibration은 두 종류의 error를 고른다

Quantization grid를 넓게 잡으면 큰 값이 덜 잘린다.

wide range
-> less clipping
-> larger step size
-> more granular error

반대로 grid를 좁게 잡으면 대부분의 작은 값은 더 정밀하게 표현된다.

tight range
-> smaller step size
-> better resolution near the bulk
-> more overload error for outliers

즉 calibration은 “오차를 없애는 작업”이 아니다. 어떤 오차를 감수할지 고르는 작업이다.

granular error: grid 안에서 rounding 때문에 생기는 작은 오차
overload error: grid 밖의 값이 clipping되며 생기는 큰 오차

LLM activation에서는 이 trade-off가 더 날카롭다. 대부분의 값은 작지만 일부 channel이나 token이 매우 큰 값을 만들 수 있기 때문이다.

Static quantization은 scale을 미리 고정한다

Static quantization은 배포 전에 calibration data를 흘려보내고 scale을 고정한다.

calibration dataset
-> collect activation ranges
-> freeze scale / zero-point
-> inference uses fixed parameters

장점은 runtime overhead가 낮고 deployment graph가 예측 가능하다는 것이다. TensorRT, ONNX Runtime, TFLite 같은 runtime에서도 잘 맞는 형태다.

하지만 calibration data와 production traffic이 다르면 문제가 생긴다. 예를 들어 calibration에서는 activation max가 8 근처였는데 실제 요청에서는 25가 나오면, 고정된 scale은 큰 값을 clip할 수 있다.

이 실패는 점진적이지 않을 수 있다. Attention logit이나 특정 hidden channel이 clip되면 작은 수치 오차가 아니라 attention pattern 자체가 바뀔 수 있다.

Dynamic quantization은 scale을 runtime에 계산한다

Dynamic quantization은 inference 중에 현재 input의 range를 보고 scale을 계산한다.

current input
-> observe current activation range
-> compute scale now
-> quantize / dequantize

이 방식은 input distribution이 흔들릴 때 강하다. 좁은 input이면 grid를 좁게 써서 resolution을 얻고, 넓은 input이면 scale을 키워 clipping을 피한다.

대신 비용이 있다.

runtime scale computation
scale metadata movement
kernel support requirement

Transformer activation에서 dynamic per-token quantization이 자주 등장하는 이유도 여기에 있다. token마다 range가 다른데 scale 하나로 전체 sequence를 덮으면 대부분 token이 precision을 낭비할 수 있다.

PTQ와 QAT는 위치가 다르다

PTQ는 학습 후에 적용한다.

training finished
-> quantize
-> evaluate

QAT는 학습이나 fine-tuning 과정에서 quantization 효과를 흉내 낸다.

training / fine-tuning
-> fake quantization
-> model adapts to quantization noise

QAT는 아주 낮은 bit-width나 까다로운 accuracy requirement에서는 유리할 수 있다. 하지만 training pipeline이 필요하고 비용이 훨씬 크다.

그래서 inference quantization path에서는 먼저 PTQ를 기본값으로 본다. QAT는 “PTQ로 품질이 안 나올 때 고려하는 더 비싼 방법”으로 이해하면 된다.

QLoRA는 여기서 다루지 않는다. QLoRA는 quantization을 이용해 fine-tuning memory를 줄이는 PEFT 쪽 주제다.

평가 없이 PTQ는 끝난 것이 아니다

PTQ는 변환이 아니라 검증까지 포함한 workflow다.

1. representative calibration data 준비
2. quantization config 선택
3. quantized checkpoint 생성
4. task quality 평가
5. serving benchmark 확인
6. layer fallback or granularity 조정

정확도만 봐도 부족하다. Weight-only PTQ는 memory footprint를 줄여도 kernel이 dequantize 후 FP16 matmul을 하면 latency가 기대만큼 줄지 않을 수 있다.

따라서 PTQ 결과는 두 축에서 확인해야 한다.

quality: perplexity, benchmark, task-specific eval
serving: memory, TTFT, TPOT, throughput, concurrency

확인

Calibration data가 production traffic을 대표하지 못하면 어떤 일이 생기는가?
Static quantization과 dynamic quantization의 trade-off는 무엇인가?
PTQ가 QAT보다 먼저 시도되는 실무적 이유는 무엇인가?