CPU Reference와 검증 루프

CUDA kernel을 만들 때 첫 목표는 빠른 코드가 아니라 맞는 코드다.

CPU reference
GPU kernel
elementwise compare
max error / tolerance

이 순서가 있어야 이후 최적화에서 결과가 깨졌는지 바로 알 수 있다.

CPU reference

CPU reference는 느려도 된다. 중요한 것은 읽기 쉽고 믿을 수 있어야 한다는 점이다.

void add_cpu(const float* a, const float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

새 CUDA kernel을 만들 때는 먼저 같은 연산을 CPU loop로 구현한다.

float 연산은 완전히 같은 bit가 나오지 않을 수 있다. 그래서 보통 tolerance를 둔다.

float diff = fabs(cpu_value - gpu_value);
if (diff > tolerance) {
    // mismatch
}

처음에는 작은 shape로 직접 출력하고, 그다음 큰 shape에서 전체 mismatch를 확인한다.

작은 shape:
  row/col/index를 눈으로 확인

큰 shape:
  max absolute error와 mismatch count 확인

PyTorch extension으로 넘어가면 reference는 PyTorch로 만들 수 있다.

ref = torch_op(x)
out = custom_cuda_op(x)
torch.testing.assert_close(out, ref, rtol=1e-4, atol=1e-4)