NF4 and QLoRA for Fine-Tuning Memory

NF4와 QLoRA는 serving throughput을 먼저 겨냥하지 않는다.

핵심 pressure는 fine-tuning memory다.

7B model in FP16:
weights alone ~= 14 GB

fine-tuning:
weights + gradients + optimizer states + activations
-> 16 GB consumer GPU에 안 들어감

GPTQ/AWQ는 weight-only inference artifact를 만드는 쪽이다. Fine-tuning을 하려면 gradient와 optimizer state가 필요하고, quantized base weight를 어떻게 다룰지도 달라진다.

NF4는 exponent/mantissa format이 아니다

NF4는 FP4처럼 exponent와 mantissa를 나눠 가진 floating-point format이 아니다.

NF4는 normal distribution에 맞춘 16개 codebook value다.

INT4:
uniform grid

FP4:
exponent/mantissa로 만든 non-uniform numeric format

NF4:
normal distribution quantile에 맞춘 lookup table

Neural network weight는 대체로 0 근처에 많이 몰린다. NF4는 이 분포를 가정하고, 자주 나오는 구간에 code를 더 촘촘히 둔다.

그래서 NF4의 질문은 이것이다.

4-bit 숫자 format을 어떻게 설계할까?

보다 정확한 질문:
weight distribution을 가장 적은 reconstruction error로 담는 16개 대표값은 무엇인가?

QLoRA는 base model 전체를 full fine-tuning하지 않는다.

base model:
NF4로 저장
frozen

LoRA adapter:
작은 low-rank matrix
trainable

compute:
필요할 때 base를 dequantize해 사용
adapter만 update

이 구조 덕분에 4-bit base weight가 큰 memory 절감을 만들고, 학습 가능한 parameter 수는 LoRA adapter로 제한된다.

중요한 점은 QLoRA가 “모든 학습 memory를 4-bit로 만든다”는 뜻이 아니라는 것이다. Activation, gradient, optimizer state, adapter parameter는 여전히 memory를 쓴다.

4-bit weight를 쓰면 scale metadata도 많아진다.

block of weights
-> NF4 code values
-> scale

QLoRA 계열에서는 이 scale 자체도 다시 quantize해 metadata overhead를 줄일 수 있다. 이를 double quantization이라고 부른다.

핵심은 weight data만 보는 것이 아니라, quantization을 해석하는 데 필요한 scale metadata까지 memory budget에 포함해야 한다는 점이다.