Expose more cudaEvent functionality.

This is necessary to hide kernel launch latency.
For example, I want to launch kernel function K after the async CPU function C has executed.
Usually it is done by passing the kernel call as a callback of the async function C;

void C(std::function<void()> callback);
C( []{K();} );

However, I want to do it like this,

cudaEvent_t event;
cudaEventRecord(event); // record without stream, from CPU.
// now call C, in the call back trigger the event.
C( []{ cudaEventUnRecord(event); } )
// now ask the stream to wait for the event before executing K.
cudaStreamWaitEvent(event, stream);
K();

In this way, we can hide the kernel launch by overlapping it with the async CPU function C.
Is it possible to expose this interface?

you can not eliminate launch overhead, only hide it
you mention hiding it, but are you not trying to eliminate it?

you can rush the conditions/ input necessary to launch the kernel
you can launch the kernel as quickly as possible
if the kernel launch is not at the end of the function, why not simply use 2 functions?
kernel launches are checked (by the driver)
memory copies may need to be issued
hence, exactly what are you going to overlap with what, and how much is there really to squeeze out?