Attention Score Matrix

cudaattentiontransformermatmul

attention의 첫 단계는 query와 key의 dot product다.

S = Q K^T

score matrix가 크다

sequence length가 N이면 score matrix는 [N, N]이다. token 수가 4096이면 score 원소는 약 1,677만 개다. head마다, batch마다 이 matrix가 생긴다.

naive attention은 보통 다음 순서로 생각한다.

QK^T 계산 -> score matrix 저장 -> softmax -> V와 곱함

이 방식은 이해하기 쉽지만 score matrix를 HBM에 쓰고 다시 읽는다.