Cuda vs Vulkan performance difference


I asked the same question in the Cuda forum but was redirected here. The original question has the Cuda kernel and here’s the GLSL shader for completeness:

void main() {
    uint row = gl_GlobalInvocationID.x;
    uint col = gl_GlobalInvocationID.y;

    uint offset = N * row;

    float result = 0.0f;

    for (uint s = 0; s < N; ++s)
      result += a[offset + s] * b[col + s * N];

    c[offset + col] = result;

In an attempt to understand the difference, I used the extension VK_KHR_shader_clock to measure the performance of separate threads (with the function clockRealtimeEXT())and print the result (with debugPrintfEXT()). I did the same in the Cuda kernel (clock64() and printf). I am not sure if this is correct at all, because the functions return time in clock cycles, and in Vulkan, there’s no way to translate it to seconds, but since I am running on the same hardware I am expecting this to be correct. So after doing this I see a similar to before difference, for example, on Quadro P1000 the numbers are: ~1845248 for Vulkan and ~568672 for Cuda which gives approximately the same x3 difference, as was measured with other timers.

As was pointed out by @Robert_Crovella, it was my mistake - I wrongly used rows and columns indexing:

// Vulkan
uint row = gl_GlobalInvocationID.x;
uint col = gl_GlobalInvocationID.y;

// Cuda
uint32_t const row{blockIdx.y * blockDim.y + threadIdx.y};
uint32_t const col{blockIdx.x * blockDim.x + threadIdx.x};

As a reminder - gl_GlobalInvocationID.x is equal to gl_WorkGroupID.x * gl_WorkGroupSize.x + gl_LocalInvocationID.x . So the indexing was reversed in Vulkan, which greatly impacted the memory access pattern.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.