cudaLaunchKernel Hangs When Interleaving Multiple Empty Kernels and cudaLaunchHostFunc

Robert_Crovella · June 25, 2022, 2:17pm

You’ve created a deadlock here. (Stating the obvious).

A callback is not supposed to implicitly or explicitly depend on CUDA API activity.

So what is going on here?

Asynchronous activity (work) issued to the GPU goes into a queue. As the GPU becomes able to process the work, it is unloaded from the queue and dispatched to the GPU. As you’ve already pointed out, you are issuing alternating kernels and host functions into the same stream.

Each of your callbacks has the property that it is waiting on a semaphore (or whatever you want to call it) that will only be signalled after all work issuance is complete.

So first you issue a kernel. Presumably at some point later that runs to completion. Then you issue a callback, but this callback will not complete until all your work issuance is complete, and you get around to signalling the callbacks. So the first issued callback starts but does not complete. Therefore all subsequent work issued (into that stream) will sit in the queue, due to CUDA stream semantics, waiting until it can be issued.

Eventually you run out of queue depth. The queues provided to support asynchronous work issuance are not of infinite depth. When the queue becomes full, a kernel launch changes from an asynchronous, non-blocking call, to a synchronous blocking call, waiting for a queue slot to open up, before it can put the new kernel launch in the queue and return control to the host thread. And as a result of the full queue and this change in behavior of the kernel launch mechanism, your code “hangs” at the kernel launch point.

This is all expected behavior.

Don’t do that.

If you issue a host function into a stream, then the principal target of this functionality is that you are saying “this host function may begin when the previous CUDA processing is complete”. That should work fine.

However your code seems to be saying “this host function may begin when the previous CUDA processing is complete and some additional work has been issued to the GPU, which may or may not be complete.”

It’s hard for me to imagine what the benefit of waiting until some work has been issued but “may or may not be complete”, and certainly typical usage of host functions isn’t designed for that case. Especially since the host function works in a separate CPU thread anyway. It’s not as if launching a host function is going to delay the issuance of additional work, and it’s not as if launching the additional work is going to delay the processing of the host function when it is ready to go.

So I don’t have any suggestions for “workarounds” based on what you have shown here. I don’t understand the program logic, and it looks like a test case designed to provoke this kind of behavior.

Don’t do that.

Topic		Replies	Views
Does cudaLaunchHostFunc block work added to all streams? CUDA Programming and Performance	19	1661	October 12, 2021
cudaLaunchKernel（） hang Linux	0	70	October 23, 2024
cudaLaunchKernel hang in function futex_abstimed_wait CUDA Programming and Performance	6	799	November 6, 2023
cudaLaunchHostFunc API example CUDA Programming and Performance	31	6763	February 8, 2025
culaunchHostFunc overhead latency usage + CPU->GPU signaling CUDA Programming and Performance	6	226	April 1, 2025
cudaLaunchHostFunc blocking work on Linux CUDA Programming and Performance cuda	2	743	September 22, 2022
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	8080	August 20, 2015
Deadlock on cuda kernel launch CUDA Programming and Performance cuda	12	240	July 21, 2025
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1538	June 5, 2014
Device hangs / freezes / crashes under specific circumstances CUDA Programming and Performance cuda , kernel	5	1166	September 1, 2024

cudaLaunchKernel Hangs When Interleaving Multiple Empty Kernels and cudaLaunchHostFunc

Related topics