How to block a single host thread on a CUDA event multi-threaded CUDA application

Suppose we have a multi-threaded application using the GPU.

Is there a way to block only a single host thread until a specific CUDA/OpenCL event is recorded?

Here’s an example:

CPU Thread T2:

- wait until event EX is recorded in stream SY

- do some T2 work

(CPU Threads T0, T1, T3, ... running)

cudaEventSynchronize on an event created with the blocking sync flag works best

Thanks! How can I make it block on multiple events (order of events unknown)?

e.g. T2 waits for these three events before proceeding: event EX in stream SX, event EY in SY, and event EZ in SZ.

there’s no wait for multiple events, but in general you can just call that in order and eventually it will work. however, keep in mind that to avoid deadlock we require that the event be recorded before you try to query or synchronize on it.

Hm, how to ensure this condition in a multi-threaded application? The event will be recorded from one thread and captured from another. What I actually need is some synchronization mechanism between threads on a completion of some GPU work, such as a kernel execution or an async memcpy. My idea was to have something like this:

T1:

 ...some CPU work

 issue some GPU work in stream SX

 issue CUDA eventX in SX

 ...

T2:

 issue some GPU work in stream SY

 issue CUDA eventY in SY

T3:

 cudaEventSynchronize(eventX) 

 cudaEventSynchronize(eventY) 

 counterEventX++

 counterEventY++

 ...some CPU or GPU work

How to go about it?

p.s. Is there a possibility to capture every occurence of an event EX? Or is there something like a fast atomic increment of a host variable from a CUDA stream? I’m looking for a way to implement a semaphore that signals when some (asynchronously issued) GPU work has completed.