Reduction Pattern

elementwise kernel은 보통 input 하나 또는 몇 개로 output 하나를 만든다.

y[i] = f(x[i])

reduction은 여러 input을 하나의 값으로 줄인다.

sum = x[0] + x[1] + ... + x[n-1]
max = max(x[0], x[1], ..., x[n-1])

왜 중요할까

Transformer kernel에는 reduction이 계속 나온다.

softmax:
  row max
  row sum(exp)

RMSNorm:
  row sum(x^2)

sampling:
  top-k, top-p, max

elementwise는 독립적이다.

thread 0 -> y[0]
thread 1 -> y[1]
thread 2 -> y[2]

reduction은 여러 thread의 결과를 합쳐야 한다.

thread partial sums
-> block sum
-> final sum

그래서 __syncthreads, shared memory, warp shuffle 같은 도구가 최적화 단계에서 필요해진다.

이 path에서는 reduction을 깊게 최적화하지 않는다. 대신 다음을 구분할 수 있으면 충분하다.

elementwise:
  각 output이 독립적

reduction:
  output 하나가 여러 input에 의존

이 구분이 있어야 Path 3에서 softmax/RMSNorm을 최적화할 수 있다.