Do asynchronous activities issued to different streams share the same queue?

Thanks for the previous answer! In addition to this post, we noticed that cudaEventRecord does the similar activity as cudaLaunchKernel.

Thus, we are hoping for some cognitive confirmation:
Asynchronous activities share the same queue (globally), even if those activities are being issued to different CUDA streams. Is this correct?

We got this assumption by the experiment below (CUDA 1.13). Result shows that cudaLaunchKernel is also blocked even if the launching activities are issued to another stream and hosted by another thread.

#include <iostream>
#include <vector>
#include <mutex>
#include <condition_variable>
#include <thread>

#include "cuda_runtime.h"
#include "errorcheck.h"


// empty kernel
__global__ void NoOpKernel() {}

// for blocking stream to wait for host signal
class Event {
 private:
  std::mutex mtx_condition_;
  std::condition_variable condition_;
  bool signalled = false;

 public:
  void Signal() {
    {
      std::lock_guard<decltype(mtx_condition_)> lock(mtx_condition_);
      signalled = true;
    }
    condition_.notify_all();
  }

  void Wait() {
    std::unique_lock<decltype(mtx_condition_)> lock(mtx_condition_);
    while (!signalled) {
      condition_.wait(lock);
    }
  }
};

void CUDART_CB block_op_host_fn(void* arg) {
  Event* evt = (Event*)arg;
  evt->Wait();
}


static cudaEvent_t end_event;

// keep launching empty kernels
static void launchKernels(cudaStream_t stream) {
  for (int i = 0; i < 1000; i++) {
    NoOpKernel<<<1, 128, 0, stream>>>();
    std::cout << "Done KernelLaunch #" << i << std::endl;
  }
}

static void blocker(cudaStream_t stream) {
  int num_events = 150;  // hangs if THIS >= 57
  std::vector<std::shared_ptr<Event>> event_vec;

  for (int i = 0; i < num_events; i++) {
    event_vec.push_back(std::make_shared<Event>());
    cudaLaunchHostFunc(stream, block_op_host_fn, event_vec.back().get());

    std::cout << "Before recording #" << i << std::endl;
    CUDACHECK(cudaEventCreate(&end_event));
    CUDACHECK(cudaEventRecord(end_event, stream));
    std::cout << "After recording  #" << i << std::endl;
  }

  for (int i = 0; i < num_events; i++) {
    event_vec[i]->Signal();
  }
}


int main() {
  cudaStream_t stream1, stream2;
  CUDACHECK(cudaStreamCreate(&stream1));
  CUDACHECK(cudaStreamCreate(&stream2));

  auto t1 = std::thread(blocker, stream1);
  auto t2 = std::thread(launchKernels, stream2);

  t1.join();
  t2.join();

  CUDACHECK(cudaDeviceSynchronize());
  return 0;
}

In our case, we do need to prevent anything to be issued to a certain stream until some works are completed. But we cannot explicitly wait for “some works” before we have to issue some asynchronous activities or kernels. One reason is that “some works” might still be like in unknown stage at the point when we have to issue other things.

Asynchronous activities share the same queue (globally), even if those activities are being issued to different CUDA streams. Is this correct?

I don’t think the asynchronous queue structure is documented anywhere by NVIDIA. Even if I tried to answer that, it could change in the next (or previous) CUDA version, or based on other factors.

It is possible to demonstrate that the “queue depth” seems to vary based on exactly what you are doing.

If the application creates more streams than connections (work queues) then multiple streams will alias to the same work queue… The environment variable CUDA_DEVICE_MAX_CONNECTIONS can be used to set the preferred number of compute and copy engine work queues. The default value is 8. The range is 1-32.

1 Like