Naive Softmax Kernel

Softmax는 row의 score들을 probability로 바꾼다.

y_i = exp(x_i - max(x)) / sum_j exp(x_j - max(x))

Transformer attention에서는 score row마다 softmax를 적용한다.

세 단계

stable softmax는 보통 세 단계로 본다.

1. row maximum 찾기
2. sum(exp(x - max)) 계산
3. normalize

이 세 단계는 row 전체 값을 필요로 한다. 그래서 softmax는 단순 elementwise kernel이 아니다.

Naive row-wise kernel

가장 단순한 버전은 thread 하나가 row 하나를 맡는다.

__global__ void softmax_naive_kernel(
    const float* x,
    float* y,
    int rows,
    int cols
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < rows) {
        float max_val = x[row * cols];
        for (int col = 1; col < cols; col++) {
            max_val = fmaxf(max_val, x[row * cols + col]);
        }

        float sum = 0.0f;
        for (int col = 0; col < cols; col++) {
            sum += expf(x[row * cols + col] - max_val);
        }

        for (int col = 0; col < cols; col++) {
            y[row * cols + col] = expf(x[row * cols + col] - max_val) / sum;
        }
    }
}

왜 naive인가

이 kernel은 이해하기 쉽지만 row 하나를 thread 하나가 모두 처리한다. row가 길면 parallelism을 충분히 쓰지 못한다.

나중에는 여러 thread가 같은 row를 나누어 읽고, max와 sum을 reduction으로 계산한다.

확인

softmax가 elementwise가 아닌 이유는 무엇인가?
왜 x - max를 한 뒤 exp를 계산하는가?
thread 하나가 row 하나를 맡는 naive 방식의 한계는 무엇인가?

Naive Softmax Kernel

세 단계

Naive row-wise kernel

왜 naive인가

확인

연결된 카드