Warp-level Reduction

cudawarpreductionshuffle

warp 안 thread들은 lockstep으로 실행된다. 그래서 warp 내부 reduction은 shared memory 없이 shuffle instruction으로 처리할 수 있다.

대표 패턴

for (int offset = 16; offset > 0; offset >>= 1) {
    value += __shfl_down_sync(0xffffffff, value, offset);
}

offset=16이면 lane 0은 lane 16의 값을 받는다. 다음에는 8, 4, 2, 1로 줄이며 합친다.

언제 쓰나

softmax, RMSNorm, layer norm처럼 row 안에서 sum/max가 필요한 kernel에서 자주 나온다. block reduction의 마지막 단계에서 warp reduction을 섞기도 한다.

확인

__shfl_down_sync는 memory load/store가 아니라 lane 간 값 교환이다.
warp 크기는 32이므로 offset은 보통 16에서 시작한다.
warp 밖 thread와 값을 합치려면 block-level reduction이 추가로 필요하다.