cuLaunchHostFunc Questions

Hi NVIDIA Team,

I have been experimenting with cuLaunchHostFunc in CUDA and have encountered some behavior that I would like to clarify. Along with my inquiry, I have attached a snippet of my code and profiling results from Nsight Systems. In following Figures, I run the code with 4 streams

Here are my questions:

  1. Dependency Concerns with Host Function Execution in Figure 1:
    I understand that all host functions launched using cuLaunchHostFunc are executed by a single CPU thread. However, I’m concerned about potential false dependencies. Can you confirm whether this setup could lead to such issues? It seems similar to the scenario of having multiple streams but only one connection in CUDA.

  2. Semaphores and Writes in CPU Thread in Figure 2:

  • In my profiling results, I noticed that there are calls to sem_wait and write in the CPU thread. However, not every invocation of the host function is preceded by a sem_wait. Additionally, there is a write operation before and after the host function. Could you please explain the purpose of these operations, especially why write?
  1. Timeline Separation of sleep_kernel in Figure 1:
  • I observed that the first sleep_kernel invocation is separated into different timelines in the profiling results. I’m wondering if this was due to an error during the Nsight Systems run or if there is a specific reason for this behavior.

I appreciate any insights or explanations you can provide regarding these questions. Thank you for your time and assistance.

#include <iostream>
#include <cstdlib>
#include <vector>
#include <chrono>
#include <thread>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

void myHostFunction(void *userData) {
    std::this_thread::sleep_for(std::chrono::milliseconds(600));
}

const unsigned long long MSECOND = 1'000'000ULL;
__global__ void sleep_kernel(){
    /// SLEEP ~1s
    for (int i = 0; i < 1000; i++) __nanosleep(MSECOND);
}

int main(int argc, char *argv[]) {
    const int NUM_CONNECTION = atoi(argv[1]);
    std::cout << "Creating number of streams: " << NUM_CONNECTION << std::endl;
    std::vector<cudaStream_t> streams(NUM_CONNECTION);
    for (auto& s : streams) 
        cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
    std::cout << "Kernel-ing" << std::endl;
    const auto CPU_START = std::chrono::steady_clock::now(); 
    for (auto& s : streams) {
       cuLaunchHostFunc(s, myHostFunction, NULL);
       sleep_kernel<<<1,1,0,s>>>();
       cuLaunchHostFunc(s, myHostFunction, NULL);
       sleep_kernel<<<1,1,0,s>>>();
    } 
    std::cout << "Launching DONE." << std::endl;
    for (auto& s : streams) 
        cudaStreamSynchronize(s);
    std::cout << "Synchronizing DONE" << std::endl;
    const auto CPU_END = std::chrono::steady_clock::now(); 
    auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(CPU_END - CPU_START);
    std::cout << "CPU SIDE ELAPSED TIME: " << elapsed_time.count() / 1000.0 << " (s)" << std::endl;
    for (auto& s : streams) 
        cudaStreamDestroy(s);
    return 0;
}

Yes, it can lead to such issues. If your intention was that two separate host funcs would run asynchronously to each other, the host func mechanism does not guarantee that. It could be the case that one host func is waiting for another host func to complete, before it begins.

Thanks.
Is the write (might with paired read) a cuda implementation detail then?
I am asking because I did not find a lot information on cuLaunchHostFunc