PyTorch MLP와 Decoder Block 조립

Decoder block은 attention mixer와 MLP mixer를 residual path 위에 쌓는다.

x = x + attention(norm1(x))
x = x + mlp(norm2(x))

MLP는 token마다 독립적으로 hidden vector를 변환한다.

self.mlp = nn.Sequential(
    nn.Linear(d_model, d_ff),
    nn.GELU(),
    nn.Linear(d_ff, d_model),
)

이 카드의 산출물은 하나의 block이다.

input:  [B, T, D]
output: [B, T, D]

확인