Can anyone provide any clues on how to use cuStreamWaitEvent to queue-up async operations?
I can’t seem to hold-back a kernel launch with it, either before or after my call to cuLaunchKernel, or in combination with cuEventRecord - given a CUstream, cuLaunchKernel launches the stream right away.
The docs are confusing:
cuStreamWaitEvent:
“The stream hStream will wait only for the completion of the most recent host call to cuEventRecord() on hEvent.”
VS
“If cuEventRecord() has not been called on hEvent, this call acts as if the record has already completed, and so is a functional no-op.”
So, it will wait until the most recent event but if there hasn’t been one it does nothing.
the kernel launch will wait for only the first instance of cuEventRecord, not both. Events are marked as already triggered if they are never recorded as a way to prevent deadlock.
StreamWaitEvent appears to queue a ‘wait for a specific event’ into a specific stream, and in this way allows me to create an execution-list, with kernel execution or memcopy dependencies (possibly with linear or tree flavours. loops?).
I’m using flags CU_EVENT_DISABLE_TIMING, CU_CTX_SCHED_BLOCKING_SYNC and CU_CTX_MAP_HOST. Are there others I should be aware of?
Anyone out there know if this approach will blow-up in my face sooner (I’m GTX285, compute 1.3)? (likely later of course, that goes without question…)
The problem in your second post is that you haven’t recorded the event before waiting on it. This causes cuStreamWaitEvent to be completed immediately, so it won’t block anything.
Why this is a good thing: Launching to the GPU is generally asynchronous, but it’s not guaranteed to be asynchronous. Eventually, you will fill up some queue somewhere and the driver will have to wait on the CPU for something to drain a bit before launching more work. If we allowed you to call cuStreamWaitEvent on an event before you recorded it, you could do something like
cuStreamWaitEvent(stream, event);
for (int i = 0; i < 1000000; i++) {
kernel<<<..., stream>>>(i); // eventually you won't be able to launch any more kernels, but no kernels can run
}
kernel2<<<..., stream2>>>(0); // you'll never reach here because you're stuck in the loop
cudaEventRecord(event, stream2); // deadlock!