cudaEvents

krazanmp · October 17, 2014, 7:01pm

I have a system I am porting from OpenCL and am trying to maintain the stream libe behaviour of it. I have some CPU code that is threaded and sends off work to the GPU asynchronously. In OpenCL I am able to create events and use these events later to determine if the GPU work is complete or not.

It seems in CUDA, this is not possible without additional CPU-side code to handle potential race conditions. For example the following code:

cudaEventCreate( CUDAEvent );
// ← CPU thread switch here can cause
cudaMemcpy(…); // ← the CPU to think this event
// ← has already been completed
cudaEventRecord( CUDAEvent , CUDAStream );

will inherently create a race condition BECAUSE the API call that queries its status can’t disambiguate between pre-record state and post-completion state. Am I missing something? Is there a way to create an event that is completely invalid until it has been completed within the stream?

PS:
FWIW: in OpenCL, this works just fine since it has 4 states: QUEUED, SUBMITTED, RUNNING, COMPLETE and a test for COMPLETE never happens until it has gone through all of the previous states.

allanmac · October 17, 2014, 9:08pm

Hmm… I didn’t realize that CUDA events start out “complete” but now I see it in the docs:

One solution might be to create a thread-safe pool of events that are created and not in use as well as a thread-safe stack/queue that contains events that have been recorded but not yet successfully synchronized against or queried.

It’s a small amount of code and has worked well for me in the past but it’s unfortunate that it is needed at all.

IMHO, CUDA should borrow from OpenCL’s event and command queue APIs. I could go on about this for hours. Not being able to easily express a “happens before” relationship between concurrent kernels is an API deficiency.

Rolling your own dependency graph enforcing scheduler is the solution, but most CUDA devs would probably prefer to work on their kernels. My instinct is that Dynamic Parallelism would also benefit from a more flexible scheduling system.

I’m guessing the design of CUDA streams was very much driven by the hardware that was available way way back when CUDA was first developed. Totally reasonable but maybe it’s time for a brush-up.

If an OpenCL’ish feature set is unacceptable then I would at least like to be able to create streams where kernels can execute in any order and not just FIFO. Also, I want to be able to express happens-before – either via the kernel execution configuration or preceding+following proxy events.

You can roll some of this yourself but it’s probably not going to be easy for some devs. Kernel callbacks are another solution (vs. querying or blocking) but the 30-40 usec delay I’m observing seems unnecessary.

Perhaps there are CUDA idioms for this that I’ve missed. In which case, I’d like to learn what they are! :)

Topic		Replies	Views
cudaEventQuery vs cudaStreamQuery CUDA Programming and Performance	0	1517	December 30, 2012
some cuda question CUDA Programming and Performance	6	980	December 23, 2015
Cuda syncronize APIs CUDA Programming and Performance	9	28	January 28, 2025
cudaEventRecord functionality CUDA Programming and Performance	1	11468	March 10, 2009
limit on number of events linked to a stream via cudaStreamWaitEvent()..? CUDA Programming and Performance	6	2333	May 8, 2015
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	243	April 1, 2024
cuEventElapsedTime slow/blocking? CUDA Programming and Performance	2	418	August 6, 2014
Adding CUDA streams to threaded software CUDA Programming and Performance	7	8579	August 2, 2011
4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour. CUDA Programming and Performance	5	13389	March 10, 2011
Streams and CPU CUDA Programming and Performance	1	1029	September 27, 2013

cudaEvents

Related topics