cuStreamWaitEvent using cuStreamWaitEvent with memcopies and kernel launches

orthopteroid · November 16, 2011, 12:27am

Can anyone provide any clues on how to use cuStreamWaitEvent to queue-up async operations?

I can’t seem to hold-back a kernel launch with it, either before or after my call to cuLaunchKernel, or in combination with cuEventRecord - given a CUstream, cuLaunchKernel launches the stream right away.

The docs are confusing:

cuStreamWaitEvent:

“The stream hStream will wait only for the completion of the most recent host call to cuEventRecord() on hEvent.”

VS

“If cuEventRecord() has not been called on hEvent, this call acts as if the record has already completed, and so is a functional no-op.”

So, it will wait until the most recent event but if there hasn’t been one it does nothing.

Intertubes, where art thou?

tmurray · November 16, 2011, 12:42am

cuLaunchKernel is asynchronous, so it will enqueue the launch and return immediately.

The docs that you mentioned basically mean that if you do this:

cuEventRecord(event, …);
cuStreamWaitEvent(stream1, event);
cuLaunchKernel(…, stream1, …);
cuEventRecord(event, …);

the kernel launch will wait for only the first instance of cuEventRecord, not both. Events are marked as already triggered if they are never recorded as a way to prevent deadlock.

orthopteroid · November 16, 2011, 2:34am

Basically, I do:

cuEventCreate( event );
cuStreamCreate( stream1 );
cuStreamWaitEvent(stream1, event);
cuLaunchKernel(…, stream1, …);
cuMemcpyDtoH(…);

And discover that my kernel ran, which I didn’t want. How do I give cuStreamWaitEvent(…) an untriggered event?

orthopteroid · November 16, 2011, 11:02pm

Ok, I got this figured out now. See concurrentKernels, except notice that I am using the driver api…

My usecase is to queue a memcopy into a stream between kernel executions, so I’ve ended up doing it like:

cuMemcpyHtoDAsync( ... stream0 );                     // push data

cuLaunchKernel( ... kernelX ... stream0 );            // queue kernelX

cuEventRecord( stream0vent, stream0 );                // queue event "end of kernelX"

cuStreamWaitEvent( stream1, stream0event );           // queue stream-join

cuMemcpyDtoHAsync( ... stream1 );                     // pull result1 (result without kernelY effects)

cuLaunchKernel( ... kernelY ... stream1 );            // queue kernelY

cuMemcpyDtoHAsync( ... stream1 );                     // pull result2 (result with kernelY effects)

cuStreamSynchronize( stream1 );                       // sync cpu

StreamWaitEvent appears to queue a ‘wait for a specific event’ into a specific stream, and in this way allows me to create an execution-list, with kernel execution or memcopy dependencies (possibly with linear or tree flavours. loops?).

I’m using flags CU_EVENT_DISABLE_TIMING, CU_CTX_SCHED_BLOCKING_SYNC and CU_CTX_MAP_HOST. Are there others I should be aware of?

Anyone out there know if this approach will blow-up in my face sooner (I’m GTX285, compute 1.3)? (likely later of course, that goes without question…)

tmurray · November 19, 2011, 12:04am

The problem in your second post is that you haven’t recorded the event before waiting on it. This causes cuStreamWaitEvent to be completed immediately, so it won’t block anything.

Why this is a good thing: Launching to the GPU is generally asynchronous, but it’s not guaranteed to be asynchronous. Eventually, you will fill up some queue somewhere and the driver will have to wait on the CPU for something to drain a bit before launching more work. If we allowed you to call cuStreamWaitEvent on an event before you recorded it, you could do something like

cuStreamWaitEvent(stream, event);

for (int i = 0; i < 1000000; i++) {

   kernel<<<..., stream>>>(i); // eventually you won't be able to launch any more kernels, but no kernels can run

}

kernel2<<<..., stream2>>>(0); // you'll never reach here because you're stuck in the loop

cudaEventRecord(event, stream2); // deadlock!

Topic		Replies	Views
Question about CUDA streams CUDA Programming and Performance	8	732	November 8, 2019
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	236	April 1, 2024
cudaEventQuery vs cudaStreamQuery CUDA Programming and Performance	0	1511	December 30, 2012
cudaLaunchKernel Hangs When Interleaving Multiple Empty Kernels and cudaLaunchHostFunc CUDA Programming and Performance cuda , kernel , a100	2	1697	June 25, 2022
How to make a kernel's execution wait for a signal from another thread CUDA Programming and Performance	4	28	October 28, 2024
Does cudaStreamWaitEvent(stream2, event1, 0) also block the stream to record event1? CUDA Programming and Performance	3	199	May 24, 2024
limit on number of events linked to a stream via cudaStreamWaitEvent()..? CUDA Programming and Performance	6	2288	May 8, 2015
cuEventCreate() : how to create an even object with state of CUDA Programming and Performance	0	654	January 21, 2016
why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec CUDA Programming and Performance	2	4665	April 4, 2013
Do asynchronous activities issued to different streams share the same queue? CUDA Programming and Performance cuda , kernel , a100	2	1032	July 14, 2022

cuStreamWaitEvent using cuStreamWaitEvent with memcopies and kernel launches

Related topics