cuStreamWaitEvent using cuStreamWaitEvent with memcopies and kernel launches

Can anyone provide any clues on how to use cuStreamWaitEvent to queue-up async operations?

I can’t seem to hold-back a kernel launch with it, either before or after my call to cuLaunchKernel, or in combination with cuEventRecord - given a CUstream, cuLaunchKernel launches the stream right away.

The docs are confusing:


“The stream hStream will wait only for the completion of the most recent host call to cuEventRecord() on hEvent.”


“If cuEventRecord() has not been called on hEvent, this call acts as if the record has already completed, and so is a functional no-op.”

So, it will wait until the most recent event but if there hasn’t been one it does nothing.

Intertubes, where art thou?

cuLaunchKernel is asynchronous, so it will enqueue the launch and return immediately.

The docs that you mentioned basically mean that if you do this:

cuEventRecord(event, …);
cuStreamWaitEvent(stream1, event);
cuLaunchKernel(…, stream1, …);
cuEventRecord(event, …);

the kernel launch will wait for only the first instance of cuEventRecord, not both. Events are marked as already triggered if they are never recorded as a way to prevent deadlock.

Basically, I do:

cuEventCreate( event );
cuStreamCreate( stream1 );
cuStreamWaitEvent(stream1, event);
cuLaunchKernel(…, stream1, …);

And discover that my kernel ran, which I didn’t want. How do I give cuStreamWaitEvent(…) an untriggered event?

Ok, I got this figured out now. See concurrentKernels, except notice that I am using the driver api…

My usecase is to queue a memcopy into a stream between kernel executions, so I’ve ended up doing it like:

cuMemcpyHtoDAsync( ... stream0 );                     // push data

cuLaunchKernel( ... kernelX ... stream0 );            // queue kernelX

cuEventRecord( stream0vent, stream0 );                // queue event "end of kernelX"

cuStreamWaitEvent( stream1, stream0event );           // queue stream-join

cuMemcpyDtoHAsync( ... stream1 );                     // pull result1 (result without kernelY effects)

cuLaunchKernel( ... kernelY ... stream1 );            // queue kernelY

cuMemcpyDtoHAsync( ... stream1 );                     // pull result2 (result with kernelY effects)

cuStreamSynchronize( stream1 );                       // sync cpu

StreamWaitEvent appears to queue a ‘wait for a specific event’ into a specific stream, and in this way allows me to create an execution-list, with kernel execution or memcopy dependencies (possibly with linear or tree flavours. loops?).

I’m using flags CU_EVENT_DISABLE_TIMING, CU_CTX_SCHED_BLOCKING_SYNC and CU_CTX_MAP_HOST. Are there others I should be aware of?

Anyone out there know if this approach will blow-up in my face sooner (I’m GTX285, compute 1.3)? (likely later of course, that goes without question…)

The problem in your second post is that you haven’t recorded the event before waiting on it. This causes cuStreamWaitEvent to be completed immediately, so it won’t block anything.

Why this is a good thing: Launching to the GPU is generally asynchronous, but it’s not guaranteed to be asynchronous. Eventually, you will fill up some queue somewhere and the driver will have to wait on the CPU for something to drain a bit before launching more work. If we allowed you to call cuStreamWaitEvent on an event before you recorded it, you could do something like

cuStreamWaitEvent(stream, event);

for (int i = 0; i < 1000000; i++) {

   kernel<<<..., stream>>>(i); // eventually you won't be able to launch any more kernels, but no kernels can run


kernel2<<<..., stream2>>>(0); // you'll never reach here because you're stuck in the loop

cudaEventRecord(event, stream2); // deadlock!