I am trying to understand the overhead latency of using culaunchHostFunc.
I have for loop in main stream 0. I want that before every iteration some CPU code will run and finish. Therefore, in another stream, I plan to use culaunchHostFunc to send event to to CPU service (culaunchHostFunc will be quick) and after launch kernel that waits for signal from CPU. after the kernel i record event for stream 0.
Pseudo code:
for:
for:
stream 1:
- culaunchHostFunc
- kernel
- signal event
stream 0:
- wait event
- kernel
for (int i = 0; i < NUM_ITERATIONS; i++) {
// stream1:
cudaLaunchHostFunc(stream1, fastCallback, &iterationIndices[i]);
timeBasedSleepKernel<<<1, 1, 0, stream1>>>(sleepMicroseconds);
// Record event after sleep kernel completes in Stream 1
cudaEventRecord(events[i], stream1);
// stream0:
// Make Stream 0 wait for the event from Stream 1 before launching next computation
cudaStreamWaitEvent(stream0, events[i], 0);
// Launch computation kernel for next iteration in Stream 0
simulateWorkKernel<<<gridSize, blockSize, 0, stream0>>>(
d_buffer, dataSize, computeIterations);
}
Now there problem that latency of returning from CPU callback is not consistent if using multistreams:
We can see here 12ms of overhead from CPU callback finish to stream 1 continue, where it does not happen every time.
Another strange thing is that if remove stream 0 the problem disappears
- Can someone explain why this happens?
- I would appreciate any advice on how to implement the flow explained above.
Many thanks.