Decoder-only Transformer

GPT 계열 LLM은 보통 decoder-only Transformer다. 큰 흐름은 다음과 같다.

token ids
  -> embedding
  -> repeated decoder blocks
  -> logits
  -> next token

각 decoder block은 masked self-attention과 MLP/FFN을 포함한다.

decoder block
  -> masked self-attention
  -> MLP/FFN

여기서 masked self-attention은 causal mask가 적용된 self-attention이다. 각 token 위치는 자기 자신과 이전 token만 볼 수 있다.

Logits

마지막 block의 출력은 여전히 [B, T, D] 모양의 hidden states다. 다음 token을 예측하려면 이를 vocab 크기의 점수로 바꿔야 한다.

hidden states: [B, T, D]
output projection: [D, V]
logits: [B, T, V]

각 위치의 logits는 vocabulary 전체에 대한 점수다. inference에서는 보통 마지막 위치의 logits에서 다음 token을 샘플링하거나 선택한다.

Encoder-style Transformer는 보통 입력 token들이 서로 양방향으로 볼 수 있다. BERT 같은 모델이 대표적이다.

Decoder-only Transformer는 causal mask 때문에 미래를 볼 수 없다. 그래서 왼쪽에서 오른쪽으로 다음 token을 생성하는 언어 모델에 잘 맞는다.