Memory Coalescing

cudamemorycoalescingwarp

CUDA kernel이 느린 가장 흔한 이유 중 하나는 계산보다 memory access가 비효율적이기 때문이다.

핵심 직관

warp 안 thread들이 인접한 주소를 읽으면 GPU는 이 요청을 더 적은 memory transaction으로 묶기 쉽다.

// coalesced
float x = input[base + threadIdx.x];

// strided
float x = input[base + threadIdx.x * stride];

둘 다 thread 하나가 float 하나를 읽는다. 하지만 hardware 입장에서는 두 번째가 더 많은 memory segment를 건드릴 수 있다.

transpose에서 자주 만난다

matrix transpose는 읽기는 연속인데 쓰기가 stride가 되거나, 반대로 쓰기는 연속인데 읽기가 stride가 되는 형태가 쉽게 나온다. 그래서 shared memory tiling으로 접근 패턴을 바꾼다.

확인

coalescing은 thread 개수 문제가 아니라 어떤 주소를 함께 읽는가의 문제다.
row-major 배열에서 A[row * width + col]은 col이 연속될 때 연속 주소가 된다.
warp 단위로 주소 패턴을 보는 습관이 필요하다.