Hello,
I am porting a Cuda kernel to Vulkan and hit a big difference in performance. I have a very simple brute-force kernel for matrices multiplication:
__global__ void kernel()
{
uint32_t const row{blockIdx.y * blockDim.y + threadIdx.y};
uint32_t const col{blockIdx.x * blockDim.x + threadIdx.x};
uint32_t offset{N * row};
float result{0.0f};
for (uint32_t s{0}; s < N; ++s)
{
result += ma[offset + s] * mb[col + s * N];
}
mc[offset + col] = result;
}
Yes, I know that this is not the best way to multiply two matrices together, but it serves as a good illustration.
On the Vulkan side, I have exactly the same shader written in GLSL. I am using identical block/group sizes and the number of blocks/groups. I am running the GPU code multiple times, synchronizing after each call - cudaDeviceSynchronize
for Cuda and device_wait_idle
for Vulkan. I measure the performance by recording the time before the kernel/dispatch call and after the synchronization. Additionally, I am using timestamp queries in Vulkan code and profiling the Cuda kernel with ncu
(can’t make Nsight Graphics work with Vulkan). On Quadro P1000 I have ~280ms for Cuda vs ~970ms for Vulkan - these are the numbers spent in kernel/shader. Also, I ran the code on the Orin board - ~160ms vs 220ms.
In Cuda, I am using the managed memory for input and output matrices. In Vulkan, I am using local device memory where I copy the data before starting to measure. The size of the matrices is 2048x2048.
Without profiling (as mentioned, I can’t use the tool with Vulkan), it’s hard for me to say why there’s such a big difference and I expect the numbers to be the same, because blocks and groups are the same concepts that lay down on the actual hardware in the same way. I heard that NVidia puts a lot of effort into the Cuda by optimizing the compiler and the driver, but in my case, there’s basically nothing to optimize - the code is very simple. What else could be the reason and what else could I try?