Cuda vs Vulkan - performance issue (possibly __syncwarp related)

zoku · October 1, 2024, 1:52pm

I am currently porting a Cuda algorithm to Vulkan.

The port is accomplished from the functional point of view - the Vulkan version provides exactly the same output as the Cuda version, given the input is the same.

The problem is performance - Vulkan is about 2-4 times slower than Cuda, depending on the NVida GPU I am running tests on.

I am aware that there may be many reasons why Vulkan is slower but in the first place I would like to focus on a single symptom which may give the root cause of the problem.

The point is the Cuda kernels used by the algorithm require __syncwarp calls in certain places - if I comment these calls out, the algorithm stops working and hangs.

I rewrote the Cuda kernels to HLSL and I use dxc to compile HLSL to SPIRV so they may be used by Vulkan. But I didn’t include the __syncwarp counterparts in the HLSL code yet (because I didn’t figure out how to do it yet).

BUT, it doesn’t prevent the Vulkan algorithm from working properly! It looks like the driver runs Vulkan shaders in a mode which makes those syncs not required. If it is the case indeed, it seems obvious that such a mode should hurt performance at the same time.

Any idea how to make Cuda and Vulkan equivalent in how they behave in regards to syncing?

MarkusHoHo · October 1, 2024, 2:42pm

Hi there @zoku, welcome back to the NVIDIA developer forums.

Sorry, I can’t answer your question regarding syncing, but I am curious as to the purpose of this?

CUDA and Vulkan are two completely different things.

CUDA is a parallel computing platform and programming model.

Vulkan is a new generation graphics and compute API that provides high-efficiency, cross-platform access to modern GPUs

CUDA is not an API and trying to “port” something from CUDA to Vulkan will only work for a very small amount of use-cases. And I am not surprised that instead of executing CUDA kernels directly, going through HLSL/SPIRV adds a perf hit.

You rather use CUDA interop within Vulkan if you want to do complex parallel GPU computations without performance loss but still work with Vulkan.

zoku · October 2, 2024, 1:05pm

Thank you for your response!

The point is Vulkan is also a parallel computing platform. Sure, there may be features of Cuda which are not available in Vulkan but my Cuda algorithm is relatively simple and I think all features used by it have a Vulkan counterpart.

You do SPIRV to the native kernel code compilation only once (at the start of the application). After that there should be no difference in performance between Cuda native kernels and kernels compiled from SPIRV. Unless the driver intentionally doesn’t apply some optimizations in the Vulkan path.

nv-lduc · October 2, 2024, 4:18pm

Hi @zoku,

Thank you for bringing this up.

While it is true that you can do parallel computing in Vulkan, the Vulkan API has different requirements than what CUDA documents. Synchronization and memory models are very different between those two, and even porting a simple algorithm from one to the other can lead to wildly different compilation outputs fed to the actual GPU.

I didn’t include the __syncwarp counterparts in the HLSL code yet (because I didn’t figure out how to do it yet).

There doesn’t seem to be an HLSL equivalent to __syncwarp. However, HLSL allows you to inline SPIR-V, so you could inline a barrier instruction at the subgroup scope, which should achieve the same thing.

Regarding your original question, it is impossible to say whether what you are experiencing is expected or not without more details. The Vulkan and CUDA environments are just too different. It would be even more helpful if you could provide a minimal application which exhibits the problem.

zoku · October 8, 2024, 10:32am

Thank you all for your responses.

I was able to significantly improve performance of the Vulkan variant - it is still not as fast as Cuda but much faster now than it originally was. Let me explain.

My algorithm is based on a number of work segments. Every segment looks like this:

send data CPU → GPU
run two compute kernels
read data GPU → CPU

I assumed that I would get a performance benefit by running both transfers in a separate transfer queue (queue with a transfer bit set only) and compute work in a compute queue.

So I used both queues to execute a segment as described above. I synced both queues using timeline semaphores (in-GPU syncing) with a single wait on CPU at the very end of the segment (using a timeline semaphore as well).

Using two queues of course means more submit operations. Also I am aware that submits may be heavy to perform. I assumed it wouldn’t be a problem as I was creating a separate thread for every queue to perform submits in async to the main thread.

That’s the theory. And that’s how it looked like when I inspected a single segment execution in NSight Graphics:

Green - memory transfers
Orange/yellow - compute execution
Red - waiting on semaphores

If I am interpreting the graph correctly, it looks like waiting on semaphores alone introduced significant overhead. Remember that it is 100% in-GPU sync - no CPU involved in this after workload was submitted to GPU. I was shocked seeing how huge the overhead was ;)

I reworked the implementation to use a single queue and do a single submit per work segment. No need to queue syncing now - only pipeline barriers in a single queue. It is much faster now.

[EDIT]

Just to clarify: it doesn’t mean the problem was solved completely.

While the Vulkan variant is much faster than it originally was, it’s still slower than the Cuda variant. Still looking where I can speed it up. I will post if I have some observations.

Thanks!

Topic		Replies	Views
Cuda vs Vulkan performance difference CUDA Programming and Performance vulkan	5	7323	January 5, 2023
Vulkan compute shaders vs. CUDA Vulkan cuda	9	10532	December 20, 2021
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1461	September 14, 2017
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6027	April 2, 2013
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23205	October 7, 2010
Cannot get any stream parallelism. CUDA Programming and Performance	13	1295	December 31, 2019
Relationship between a CUDA stream and a Vulkan queue CUDA Programming and Performance vulkan	0	163	June 26, 2024
Unusual slow NVIDIA Vulkan API on Linux, why? Vulkan	0	1041	November 18, 2018
Conflicting Copy engine usage between CUDA and Vulkan Vulkan cuda , vulkan	0	1084	April 16, 2020
cuda 7.0 -- many small parallel svds in MATLAB CUDA Programming and Performance	6	3093	April 1, 2015

Cuda vs Vulkan - performance issue (possibly __syncwarp related)

Related topics