culaunchHostFunc overhead latency usage + CPU->GPU signaling

I am trying to understand the overhead latency of using culaunchHostFunc.
I have for loop in main stream 0. I want that before every iteration some CPU code will run and finish. Therefore, in another stream, I plan to use culaunchHostFunc to send event to to CPU service (culaunchHostFunc will be quick) and after launch kernel that waits for signal from CPU. after the kernel i record event for stream 0.
Pseudo code:
for:

for: 
  stream 1:
    - culaunchHostFunc
    - kernel
    - signal event
  stream 0:
    - wait event
    - kernel
    for (int i = 0; i < NUM_ITERATIONS; i++) {
        // stream1:
        cudaLaunchHostFunc(stream1, fastCallback, &iterationIndices[i]);
        
        timeBasedSleepKernel<<<1, 1, 0, stream1>>>(sleepMicroseconds);
        
        // Record event after sleep kernel completes in Stream 1
        cudaEventRecord(events[i], stream1);

        // stream0:        
        // Make Stream 0 wait for the event from Stream 1 before launching next computation
        cudaStreamWaitEvent(stream0, events[i], 0);
        
        // Launch computation kernel for next iteration in Stream 0
        simulateWorkKernel<<<gridSize, blockSize, 0, stream0>>>(
            d_buffer, dataSize, computeIterations);
    }

Now there problem that latency of returning from CPU callback is not consistent if using multistreams:


We can see here 12ms of overhead from CPU callback finish to stream 1 continue, where it does not happen every time.

Another strange thing is that if remove stream 0 the problem disappears

  1. Can someone explain why this happens?
  2. I would appreciate any advice on how to implement the flow explained above.

Many thanks.

If you are launching a host func from each stream, those host funcs may be processed in a single extra CPU thread spun up by CUDA, and therefore may be serialized. The serialization could give rise to additional latency.

The code in gray in your post doesn’t really align with your pseudo code; so I can’t really tell what you are doing. Therefore I don’t know if this is the issue or not.

I am sorry i had a mess in the pseudo code… fixed it
I am not launching CPU callbacks from many streams but from only one stream.
@Robert_Crovella