cudaLaunchKernel Hangs When Interleaving Multiple Empty Kernels and cudaLaunchHostFunc

I ran into this weird hang where if I do the following in a loop on the same stream, it will run into a hang:

  1. an empty kernel on this stream
  2. cudaLaunchKernel on this stream that waits for an host event to be signaled

Eventually I will signal all the host events to let all the cudaLaunchKernel to be unblocked. However the code doesn’t even get to that stage, but instead hangs at some noop kernel launch when we do a lot of iterations(gdb shows the hang is inside cudaLaunchKernel itself)

Below is a simple program to reproduce this problem. For me when num_events is 50 the program still runs fine and completes. But when I try something like 60 or more it will hang.

This is run on CUDA11.3, A100 GPU

#include <iostream>
#include <vector>
#include <mutex>
#include <condition_variable>

#include "cuda_runtime.h"
#include "errorcheck.h"
// empty kernel
__global__ void NoOpKernel() {}

// for blocking stream to wait for host signal
class Event {
  std::mutex mtx_condition_;
  std::condition_variable condition_;
  bool signalled = false;

  void Signal() {
      std::lock_guard<decltype(mtx_condition_)> lock(mtx_condition_);
      signalled = true;

  void Wait() {
    std::unique_lock<decltype(mtx_condition_)> lock(mtx_condition_);
    while (!signalled) {

void CUDART_CB block_op_host_fn(void* arg) {
  Event* evt = (Event*)arg;

int main() {
  cudaStream_t stream;

  int num_events = 60; // 50 is okay, 60 will hang
  std::vector<std::shared_ptr<Event>> event_vec;

  for (int i = 0; i < num_events; i++) {
    std::cout << "Queuing NoOp " << i << std::endl;
    NoOpKernel<<<1, 128, 0, stream>>>(); // HERE : is where it hangs
    std::cout << "Queued NoOp " << i << std::endl;

    cudaLaunchHostFunc(stream, block_op_host_fn, event_vec.back().get());

    std::cout << "Queued block_op " << i << std::endl;

  for (int i = 0; i < num_events; i++) {

  // clean up
  return 0;

You’ve created a deadlock here. (Stating the obvious).

A callback is not supposed to implicitly or explicitly depend on CUDA API activity.

So what is going on here?

Asynchronous activity (work) issued to the GPU goes into a queue. As the GPU becomes able to process the work, it is unloaded from the queue and dispatched to the GPU. As you’ve already pointed out, you are issuing alternating kernels and host functions into the same stream.

Each of your callbacks has the property that it is waiting on a semaphore (or whatever you want to call it) that will only be signalled after all work issuance is complete.

So first you issue a kernel. Presumably at some point later that runs to completion. Then you issue a callback, but this callback will not complete until all your work issuance is complete, and you get around to signalling the callbacks. So the first issued callback starts but does not complete. Therefore all subsequent work issued (into that stream) will sit in the queue, due to CUDA stream semantics, waiting until it can be issued.

Eventually you run out of queue depth. The queues provided to support asynchronous work issuance are not of infinite depth. When the queue becomes full, a kernel launch changes from an asynchronous, non-blocking call, to a synchronous blocking call, waiting for a queue slot to open up, before it can put the new kernel launch in the queue and return control to the host thread. And as a result of the full queue and this change in behavior of the kernel launch mechanism, your code “hangs” at the kernel launch point.

This is all expected behavior.

Don’t do that.

If you issue a host function into a stream, then the principal target of this functionality is that you are saying “this host function may begin when the previous CUDA processing is complete”. That should work fine.

However your code seems to be saying “this host function may begin when the previous CUDA processing is complete and some additional work has been issued to the GPU, which may or may not be complete.”

It’s hard for me to imagine what the benefit of waiting until some work has been issued but “may or may not be complete”, and certainly typical usage of host functions isn’t designed for that case. Especially since the host function works in a separate CPU thread anyway. It’s not as if launching a host function is going to delay the issuance of additional work, and it’s not as if launching the additional work is going to delay the processing of the host function when it is ready to go.

So I don’t have any suggestions for “workarounds” based on what you have shown here. I don’t understand the program logic, and it looks like a test case designed to provoke this kind of behavior.

Don’t do that.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.