I have a system I am porting from OpenCL and am trying to maintain the stream libe behaviour of it. I have some CPU code that is threaded and sends off work to the GPU asynchronously. In OpenCL I am able to create events and use these events later to determine if the GPU work is complete or not.

It seems in CUDA, this is not possible without additional CPU-side code to handle potential race conditions. For example the following code:

cudaEventCreate( CUDAEvent );
// ← CPU thread switch here can cause
cudaMemcpy(…); // ← the CPU to think this event
// ← has already been completed
cudaEventRecord( CUDAEvent , CUDAStream );

will inherently create a race condition BECAUSE the API call that queries its status can’t disambiguate between pre-record state and post-completion state. Am I missing something? Is there a way to create an event that is completely invalid until it has been completed within the stream?

FWIW: in OpenCL, this works just fine since it has 4 states: QUEUED, SUBMITTED, RUNNING, COMPLETE and a test for COMPLETE never happens until it has gone through all of the previous states.

Hmm… I didn’t realize that CUDA events start out “complete” but now I see it in the docs:

One solution might be to create a thread-safe pool of events that are created and not in use as well as a thread-safe stack/queue that contains events that have been recorded but not yet successfully synchronized against or queried.

It’s a small amount of code and has worked well for me in the past but it’s unfortunate that it is needed at all.

IMHO, CUDA should borrow from OpenCL’s event and command queue APIs. I could go on about this for hours. Not being able to easily express a “happens before” relationship between concurrent kernels is an API deficiency.

Rolling your own dependency graph enforcing scheduler is the solution, but most CUDA devs would probably prefer to work on their kernels. My instinct is that Dynamic Parallelism would also benefit from a more flexible scheduling system.

I’m guessing the design of CUDA streams was very much driven by the hardware that was available way way back when CUDA was first developed. Totally reasonable but maybe it’s time for a brush-up.

If an OpenCL’ish feature set is unacceptable then I would at least like to be able to create streams where kernels can execute in any order and not just FIFO. Also, I want to be able to express happens-before – either via the kernel execution configuration or preceding+following proxy events.

You can roll some of this yourself but it’s probably not going to be easy for some devs. Kernel callbacks are another solution (vs. querying or blocking) but the 30-40 usec delay I’m observing seems unnecessary.

Perhaps there are CUDA idioms for this that I’ve missed. In which case, I’d like to learn what they are! :)