I asked the same question in the Cuda forum but was redirected here. The original question has the Cuda kernel and here’s the GLSL shader for completeness:
void main() {
uint row = gl_GlobalInvocationID.x;
uint col = gl_GlobalInvocationID.y;
uint offset = N * row;
float result = 0.0f;
for (uint s = 0; s < N; ++s)
result += a[offset + s] * b[col + s * N];
c[offset + col] = result;
In an attempt to understand the difference, I used the extension VK_KHR_shader_clock
to measure the performance of separate threads (with the function clockRealtimeEXT()
)and print the result (with debugPrintfEXT()
). I did the same in the Cuda kernel (clock64()
and printf
). I am not sure if this is correct at all, because the functions return time in clock cycles, and in Vulkan, there’s no way to translate it to seconds, but since I am running on the same hardware I am expecting this to be correct. So after doing this I see a similar to before difference, for example, on Quadro P1000 the numbers are: ~1845248 for Vulkan and ~568672 for Cuda which gives approximately the same x3 difference, as was measured with other timers.