This is necessary to hide kernel launch latency.
For example, I want to launch kernel function K after the async CPU function C has executed.
Usually it is done by passing the kernel call as a callback of the async function C;
void C(std::function<void()> callback);
C( []{K();} );
However, I want to do it like this,
cudaEvent_t event;
cudaEventRecord(event); // record without stream, from CPU.
// now call C, in the call back trigger the event.
C( []{ cudaEventUnRecord(event); } )
// now ask the stream to wait for the event before executing K.
cudaStreamWaitEvent(event, stream);
K();
In this way, we can hide the kernel launch by overlapping it with the async CPU function C.
Is it possible to expose this interface?