Cuda vs Vulkan - performance issue (possibly __syncwarp related)

I am currently porting a Cuda algorithm to Vulkan.

The port is accomplished from the functional point of view - the Vulkan version provides exactly the same output as the Cuda version, given the input is the same.

The problem is performance - Vulkan is about 2-4 times slower than Cuda, depending on the NVida GPU I am running tests on.

I am aware that there may be many reasons why Vulkan is slower but in the first place I would like to focus on a single symptom which may give the root cause of the problem.

The point is the Cuda kernels used by the algorithm require __syncwarp calls in certain places - if I comment these calls out, the algorithm stops working and hangs.

I rewrote the Cuda kernels to HLSL and I use dxc to compile HLSL to SPIRV so they may be used by Vulkan. But I didn’t include the __syncwarp counterparts in the HLSL code yet (because I didn’t figure out how to do it yet).

BUT, it doesn’t prevent the Vulkan algorithm from working properly! It looks like the driver runs Vulkan shaders in a mode which makes those syncs not required. If it is the case indeed, it seems obvious that such a mode should hurt performance at the same time.

Any idea how to make Cuda and Vulkan equivalent in how they behave in regards to syncing?

Hi there @zoku, welcome back to the NVIDIA developer forums.

Sorry, I can’t answer your question regarding syncing, but I am curious as to the purpose of this?

CUDA and Vulkan are two completely different things.

CUDA is a parallel computing platform and programming model.

Vulkan is a new generation graphics and compute API that provides high-efficiency, cross-platform access to modern GPUs

CUDA is not an API and trying to “port” something from CUDA to Vulkan will only work for a very small amount of use-cases. And I am not surprised that instead of executing CUDA kernels directly, going through HLSL/SPIRV adds a perf hit.

You rather use CUDA interop within Vulkan if you want to do complex parallel GPU computations without performance loss but still work with Vulkan.

Thank you for your response!

The point is Vulkan is also a parallel computing platform. Sure, there may be features of Cuda which are not available in Vulkan but my Cuda algorithm is relatively simple and I think all features used by it have a Vulkan counterpart.

You do SPIRV to the native kernel code compilation only once (at the start of the application). After that there should be no difference in performance between Cuda native kernels and kernels compiled from SPIRV. Unless the driver intentionally doesn’t apply some optimizations in the Vulkan path.

Hi @zoku,

Thank you for bringing this up.

While it is true that you can do parallel computing in Vulkan, the Vulkan API has different requirements than what CUDA documents. Synchronization and memory models are very different between those two, and even porting a simple algorithm from one to the other can lead to wildly different compilation outputs fed to the actual GPU.

I didn’t include the __syncwarp counterparts in the HLSL code yet (because I didn’t figure out how to do it yet).

There doesn’t seem to be an HLSL equivalent to __syncwarp. However, HLSL allows you to inline SPIR-V, so you could inline a barrier instruction at the subgroup scope, which should achieve the same thing.

Regarding your original question, it is impossible to say whether what you are experiencing is expected or not without more details. The Vulkan and CUDA environments are just too different. It would be even more helpful if you could provide a minimal application which exhibits the problem.

Thank you all for your responses.

I was able to significantly improve performance of the Vulkan variant - it is still not as fast as Cuda but much faster now than it originally was. Let me explain.

My algorithm is based on a number of work segments. Every segment looks like this:

  • send data CPU → GPU

  • run two compute kernels

  • read data GPU → CPU

I assumed that I would get a performance benefit by running both transfers in a separate transfer queue (queue with a transfer bit set only) and compute work in a compute queue.

So I used both queues to execute a segment as described above. I synced both queues using timeline semaphores (in-GPU syncing) with a single wait on CPU at the very end of the segment (using a timeline semaphore as well).

Using two queues of course means more submit operations. Also I am aware that submits may be heavy to perform. I assumed it wouldn’t be a problem as I was creating a separate thread for every queue to perform submits in async to the main thread.

That’s the theory. And that’s how it looked like when I inspected a single segment execution in NSight Graphics:

Green - memory transfers
Orange/yellow - compute execution
Red - waiting on semaphores

If I am interpreting the graph correctly, it looks like waiting on semaphores alone introduced significant overhead. Remember that it is 100% in-GPU sync - no CPU involved in this after workload was submitted to GPU. I was shocked seeing how huge the overhead was ;)

I reworked the implementation to use a single queue and do a single submit per work segment. No need to queue syncing now - only pipeline barriers in a single queue. It is much faster now.

[EDIT]

Just to clarify: it doesn’t mean the problem was solved completely.

While the Vulkan variant is much faster than it originally was, it’s still slower than the Cuda variant. Still looking where I can speed it up. I will post if I have some observations.

Thanks!