Let’s say I want to launch task_a and task_b simultaneously (in different streams). As soon as one of them finishes, I want to launch task_c.
I suppose I could have the host thread wake up every once in a while, and call cudaEventQuery to see if any of the events occurred. However, I’m hoping for a more elegant solution. Is there one?
your any qualifier makes it more difficult
your hypothetical options are: events, callbacks, multiple host threads
you can synchronize on stream events you recorded, but i do not see how you would be able to manage the any qualifier with multiple events (cudaEventSynchronize()), with a single host thread
if you know one of the 2 tasks (a or b) executes longer than the other, you may simply record a single event after the shortest task, and also launch the shortest task first, in an attempt to guarantee that it would finish first
callbacks should be able to tell you which stream finishes first, but you are not allowed to issue cuda apis from within callback functions directly/ indirectly; hence you would have to implement a user-defined host event to really make it work; something like the host waits on a host event after issuing a and b, and the callback functions trigger it
lastly, you could have a host thread per task a and b, to issue the task and wait on it; the host threads share a volatile variable, which they then test, to determine who must launch c
One possible approach. I don’t know if it is “more elegant”.
Modify your definition of task_c as follows:
task_c will use a (e.g. pthread) mutex to control access to a global (host) variable. At the start of task_c, it will check the global variable. If it is not set, it will set it and proceed (with the rest of task_c). If it is set, it will exit (and skip the rest of task_c).
Now, in host thread 1, launch task_a into stream A, followed by cudaStreamSynchronize, followed by task_c. In host thread 2, launch task_b into stream B, followed by cudaStreamSynchronize, followed by task_c. Only one will execute task_c, whichever one gets there first. This doesn’t require any explicit spin-polling, nor the use of any events.
And if task_c is just a kernel call (as opposed to a more complicated sequence of cuda calls), it can be done with dynamic parallelism using a method similar to above. Just create a global device variable. Whichever task_c kernel begins first updates the device variable, then does a child launch on the desired task_c kernel.
Both methods are extensible to waiting on one of more than 2 previous tasks to finish.
Thanks for the suggestions!