Attention Weighted Sum

Q와 K로 score를 만들었다면, 이제 실제 정보를 가져와야 한다. 이때 쓰는 것이 V다.

attention weights: [T, T]
V: [T, H]
output: [T, H]

계산은 다음과 같다.

output = attention weights x V

각 token 위치의 출력은 여러 V 벡터의 가중합이다.

output[i]
  = weight[i, 1] * V[1]
  + weight[i, 2] * V[2]
  + weight[i, 3] * V[3]
  + ...

즉 i번째 token은 자기 attention weight에 따라 다른 token들의 정보를 섞어 새 표현을 만든다.

Attention 한 줄 요약

Self-attention의 핵심 흐름은 이렇게 정리할 수 있다.

x
  -> Q, K, V
  -> scores = QK^T
  -> weights = softmax(scores)
  -> output = weights V

여기서 Q/K는 “어디를 볼지”를 정하고, V는 “무엇을 가져올지”를 제공한다.