Matrix Transpose Kernel

transpose는 matrix의 row와 column을 바꾼다.

in[row, col] -> out[col, row]

핵심은 읽는 index와 쓰는 index가 다르다는 점이다.

int in_idx = row * cols + col;
int out_idx = col * rows + row;
out[out_idx] = in[in_idx];

왜 중요한가

matrix add에서는 읽기와 쓰기가 같은 좌표였다.

C[row, col] = A[row, col] + B[row, col]

transpose에서는 읽은 위치와 쓸 위치가 갈라진다.

read:  in[row, col]
write: out[col, row]

이것은 나중에 memory coalescing 문제로 이어진다. 어떤 방향으로 읽고 어떤 방향으로 쓰느냐에 따라 warp의 memory access가 달라진다.

naive kernel

__global__ void transpose_kernel(
    const float* in,
    float* out,
    int rows,
    int cols
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (row < rows && col < cols) {
        out[col * rows + row] = in[row * cols + col];
    }
}

검증

작은 matrix로 먼저 확인한다.

in shape  = [2, 3]
out shape = [3, 2]

CPU reference와 GPU output을 비교한다.

확인

in[row * cols + col]에서 stride는 무엇인가?
out[col * rows + row]에서 왜 rows를 곱하는가?
transpose가 matrix add보다 indexing 측면에서 어려운 이유는 무엇인가?

왜 중요한가

naive kernel

검증

확인

연결된 카드