Multithreading increases API call overhead?

Hello everyone.
I am running a multithreaded application in which multiple threads launch multiple kernels. Each thread launches all of its kernels into its own CUDA stream. So overlapping kernels is not the issue here. The issue is with the CUDA API calls. I have noticed that if multiple threads try to launch a kernel asynchronously at around the same time the API call time increases.
Using nvvp I notice the following pattern:
t1 [------cudaLaunchKernel--------]
t2 [------------------------------------cudaLaunchKernel---------------------------------]
t3 [---------------------cudaLaunchKernel----------------------]

It looks like the API calls are synchronized implicitly. That is t3 is not able to launch its kernel until t1 has finished launching the kernel. Is this a correct assumption? and if it is, is there a way around it?

P.S: This is my first post in this forum so if there are any conventions or rules that I did not follow in this post please let me know.

Yes, I think that is correct. The CUDA runtime API (perhaps it would be better to refer to it as the “operating system for the GPU”) has various functionality that cannot be simply multithreaded. Certain requests require the runtime API to acquire (and release) locks under the hood, for example. These types of activities when performed in a multithreaded way may produce a visible increase in apparent call time. You can find other questions here discussing similar issues.

I’m not aware of any way to entirely circumvent the behavior.

1 Like

I see. Thanks for clarifying.