Cuda vs Vulkan performance difference

Hello,

I am porting a Cuda kernel to Vulkan and hit a big difference in performance. I have a very simple brute-force kernel for matrices multiplication:

__global__ void kernel()
{
  uint32_t const row{blockIdx.y * blockDim.y + threadIdx.y};
  uint32_t const col{blockIdx.x * blockDim.x + threadIdx.x};

  uint32_t offset{N * row};

  float result{0.0f};

  for (uint32_t s{0}; s < N; ++s)
  {
    result += ma[offset + s] * mb[col + s * N];
  }

  mc[offset + col] = result;
}

Yes, I know that this is not the best way to multiply two matrices together, but it serves as a good illustration.

On the Vulkan side, I have exactly the same shader written in GLSL. I am using identical block/group sizes and the number of blocks/groups. I am running the GPU code multiple times, synchronizing after each call - cudaDeviceSynchronize for Cuda and device_wait_idle for Vulkan. I measure the performance by recording the time before the kernel/dispatch call and after the synchronization. Additionally, I am using timestamp queries in Vulkan code and profiling the Cuda kernel with ncu (can’t make Nsight Graphics work with Vulkan). On Quadro P1000 I have ~280ms for Cuda vs ~970ms for Vulkan - these are the numbers spent in kernel/shader. Also, I ran the code on the Orin board - ~160ms vs 220ms.

In Cuda, I am using the managed memory for input and output matrices. In Vulkan, I am using local device memory where I copy the data before starting to measure. The size of the matrices is 2048x2048.

Without profiling (as mentioned, I can’t use the tool with Vulkan), it’s hard for me to say why there’s such a big difference and I expect the numbers to be the same, because blocks and groups are the same concepts that lay down on the actual hardware in the same way. I heard that NVidia puts a lot of effort into the Cuda by optimizing the compiler and the driver, but in my case, there’s basically nothing to optimize - the code is very simple. What else could be the reason and what else could I try?

You might wish to ask this on the Vulkan forum. here is an example of a question on the vulkan forum. I’m not sure if there will be any vulkan experts here.

Thank you for the suggestion. I duplicated the question here.

why did you reverse the sense of x and y between your row and col variables between the shader and cuda versions? That kind of thing matters for performance, if nothing else.

In CUDA, for performance, we usually want to associate row indexing with y grid variables, and column indexing with x grid variables. I wouldn’t be surprised if a shader has a similar sensitivity.

1 Like

Wow! You are my savior, Robert. It was a typo that I couldn’t spot by myself. Thank you very much!

After fixing the indexing the performance became identical with both APIs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.