Calibration Validation and Range Estimation

Calibration data를 돌리는 이유는 단순히 sample을 많이 보기 위해서가 아니다.

진짜 목적은 각 layer에서 이런 질문에 답하는 것이다.

이 tensor의 real value를 어느 범위까지 integer grid 안에 넣을 것인가?

범위를 너무 넓게 잡으면 outlier는 안전하지만 대부분의 값이 거친 grid를 쓴다. 범위를 너무 좁게 잡으면 대부분의 값은 촘촘해지지만 outlier가 clipping된다.

그래서 calibration 이후의 판단은 “대표 데이터를 돌렸다”에서 끝나지 않는다. 그 데이터가 activation space를 충분히 덮었는지, 그리고 선택한 range가 task metric을 보존하는지까지 봐야 한다.

Diagnostic outlier ratio = max(|x|) / p99(|x|)

AbsMax

max(|x|)

Clipping: 0% on calibration
Trade-off: outlier protects range
Best fit: weights, clean activations

Percentile

p99.9 / p99.99

Clipping: small accepted
Trade-off: bulk gets resolution
Best fit: moderate outliers

MSE-optimal

search alpha

Clipping: chosen by error
Trade-off: minimize x vs x_hat
Best fit: heavy-tailed activations

KL / Entropy

match distribution

Clipping: can be aggressive
Trade-off: shape over value error
Best fit: well-behaved CNN ranges

< 3x AbsMax is usually safe

3-10x Try percentile or MSE-optimal

> 10x Prefer MSE-optimal, inspect layers

Calibration is not just collecting min and max. It chooses where the integer grid spends resolution, then task validation decides whether that choice is acceptable.

대표 데이터는 input이 아니라 activation을 대표해야 한다

Representative calibration data는 production input을 무작위로 조금 뽑는다는 뜻이 아니다.

중요한 것은 model 내부 activation이 만날 range를 덮는 것이다.

좋은 calibration set:
production에서 실제로 나오는 activation pattern을 덮음

나쁜 calibration set:
겉보기 input은 많지만 쉬운 case만 반복됨

LLM이라면 짧은 대화만 넣는 calibration set은 위험하다. 코드, 긴 context, 수학, 여러 언어, structured text가 activation을 훨씬 크게 만들 수 있기 때문이다.

Vision 모델도 마찬가지다. 밝고 깨끗한 이미지로만 calibration하면 low-light, high-contrast, motion blur 같은 production case에서 activation range가 달라질 수 있다.

Coverage를 layer별로 본다

Calibration range가 production range를 덮는지 확인해야 한다.

예를 들어 어떤 layer에서 calibration max가 42.8인데 production shadow run에서 51.3이 관찰되면:

coverage = calibration max / production max
         = 42.8 / 51.3
         = 83.4%

이 layer는 실제 배포에서 clipping될 가능성이 높다.

coverage >= 100%  -> OK
coverage < 100%   -> WARNING
coverage < 90%    -> CRITICAL

이 숫자는 “calibration set이 충분히 대표적인가?”를 보는 실무적인 신호다. 특정 layer만 coverage가 낮다면 그 layer를 크게 자극하는 prompt나 image slice를 calibration set에 추가해야 한다.

Range estimation은 clipping과 resolution의 선택이다

가장 단순한 방법은 absmax다.

range = max(|x|)
scale = range / 127

AbsMax는 calibration set 안에서는 clipping을 만들지 않는다. 하지만 outlier 하나가 range를 크게 만들면 scale이 커지고, 대부분의 작은 값은 integer grid를 거칠게 쓴다.

Percentile 방식은 일부 outlier를 포기한다.

range = p99.9(|x|) or p99.99(|x|)

이 방식은 clipping을 조금 허용하는 대신, 대부분의 값에 더 촘촘한 grid를 준다.

MSE-optimal은 여러 range 후보를 시험한다.

for candidate range:
  quantize -> dequantize
  measure MSE(x, x_hat)

choose range with lowest MSE

Transformer activation처럼 heavy-tailed distribution에서는 MSE-optimal이나 보수적인 percentile이 absmax보다 나을 수 있다.

KL/entropy 방식은 원래 분포와 quantized 분포의 모양을 맞추려는 방법이다. CNN처럼 분포가 비교적 얌전한 경우에는 쓸 수 있지만, LLM activation처럼 tail이 긴 경우에는 tail을 너무 심하게 잘라서 실패할 수 있다.

Outlier ratio가 첫 진단값이다

Range method를 고르기 전에 먼저 outlier ratio를 본다.

outlier ratio = max(|x|) / p99(|x|)

이 값이 작으면 absmax를 써도 grid 낭비가 크지 않다. 반대로 이 값이 크면 max 하나가 scale을 지배하고 있다는 뜻이다.

대략적인 판단은 이렇다.

outlier ratio < 3x
-> absmax로 시작 가능

outlier ratio 3x ~ 10x
-> percentile 또는 MSE-optimal 비교

outlier ratio > 10x
-> MSE-optimal, 더 세밀한 granularity, layer fallback 검토

이때 중요한 점은 layer마다 outlier ratio가 다를 수 있다는 것이다. 전체 모델 평균이 아니라 문제가 되는 layer를 찾아야 한다.

최종 평가는 task metric으로 한다

Calibration error가 낮아도 task 성능이 보존된다는 보장은 없다.

따라서 최종 평가는 항상 FP baseline과 비교한다.

1. FP16/FP32 baseline task metric 측정
2. calibration data로 scale / zero-point / range 계산
3. quantized model 생성
4. calibration에 쓰지 않은 eval set에서 다시 평가
5. baseline 대비 delta가 허용 범위 안인지 판단

LLM에서는 perplexity, task benchmark, internal eval, human/judge eval을 함께 볼 수 있다. 분류 모델이라면 accuracy, F1, calibration error 같은 task metric을 본다.

여기서 calibration data와 evaluation data는 분리하는 것이 좋다.

calibration data:
quantization parameter를 정하기 위한 데이터

evaluation data:
양자화 후 모델 품질을 판단하기 위한 데이터

실패하면 무엇을 바꾸나

PTQ 결과가 허용 범위를 넘어서 나쁘면, 바로 QAT로 가기 전에 원인을 나눠 봐야 한다.

coverage failure:
calibration set이 production activation을 못 덮음
-> calibration data 보강

outlier-dominated range:
scale이 outlier에 끌려감
-> percentile, MSE-optimal, finer granularity

specific layer failure:
특정 layer만 민감함
-> mixed precision or layer fallback

bit-width too aggressive:
INT4 이하에서 task metric이 크게 하락
-> GPTQ/AWQ, QAT, higher bit-width 검토

4장의 핵심은 PTQ를 “한 번 변환하는 명령”으로 보지 않는 것이다. PTQ는 calibration, range 선택, task validation, 실패 원인 분석까지 포함한 workflow다.

확인

Calibration set은 input distribution보다 무엇을 대표해야 하는가?
AbsMax가 outlier에 약한 이유는 무엇인가?
Outlier ratio가 크면 어떤 range estimation 방법을 먼저 의심해야 하는가?
Calibration data와 evaluation data를 분리해야 하는 이유는 무엇인가?