Scheduling kernel dependencies

I have three kernels, A, B, and C. A and B are independent and can be run in either order, or ideally concurrently.

Kernel C depends on the results of both A and B so it should be run after both kernels are finished.

I can schedule A and B on different streams and let the GPU automatically schedule the kernels, hopefully concurrently assuming the GPU has the resources for it.

But how do I efficiently schedule kernel C to begin after both A and B? The obvious method is to start the kernels A and B on two streams, call the CPU to synchronize on stream A and then call the CPU to synchronize again on stream B, then have the CPU launch kernel C. This works.

But these are short kernels and my Jetson CPU is not fast, so there’s a noticable CPU overhead and latency in this CPU spin state, enough to really hurt throughput since the GPU stays idle half the time while it waits for the (overloaded) CPU to juggle the scheduling. The CPU latency is exacerbated by the fact that the CPU is doing its own compute in parallel.

Is there a way to have the GPU or GPU driver launch kernel C for me after both A and B are done?

It seems like a terrible hack but I’m getting better performance by having A and B write a global “I’m done” flag to device memory when their last block finishes and if both are finished, launching kernel C using dynamic parallelism from inside of kernel A or B. This is just a terrible ugly hack, sensitive to race conditions, random memory fences, ugly and confusing to code and maintain, but it does seem to avoid some the CPU latency. Buty there’s still a gap between kernel launches and I’m losing performance. So there must be SOME solution that has at least that efficiency but isn’t a hack.

The other idea is to completely recode the three kernels into one monolithic kernel and do even more hacks for it to self-synchronize each dependent step of the compute. This seems like an even worse solution, though it might work.

Thanks for any ideas, guys!

I usually encourage people with Jetson questions to file their questions on one of the Jetson forums, but you’re welcome to do whatever you wish. I’m not saying your question is off-topic here, only that there is a significant Jetson community out there and you may get better answers there.

Anyway, one possible approach is to launch kernels A and B into separate streams. After each kernel launch, record an event into that stream. In a 3rd stream, put two cudaStreamWaitEvent() calls, one for each of the two previously mentioned events. After those two calls in the 3rd stream, launch kernel C.

This should all be fully asynchronous (i.e. not blocking the CPU thread in any way) and the launching of kernel C should be handled completely by GPU scheduler (i.e. not by CPU code).

Whether or not this is more performant than your current approaches, I cannot say. Whether or not it completely removes gaps that you are seeing in the profiler, I cannot say.

txbob, thanks for the reply.

My issue isn’t really Jetson specific since I also run on desktop. But your answer gave the likely solution!
I was using cudaStreamSynchronize() twice on A and B, and somehow skipped over using cudaStreamWaitEvent() , thinking it as being equivalent in my use case. But from now re-reading the docs, I see there’s a significant difference… the StreamWaitEvent works at the driver scheduler level… which is exactly what I was looking for! I bet this will work.

I should have asked for help before spending days in my embarassing workaround.

Is there any reason you suggested using a third stream for kernel C? Why not just put it into one of the other streams with a single event? Ie: Stream 0 has launch kernel A, set event A-is-done. Stream 1 has launch kernel B, wait for event A-is-done, launch kernel C.

Maybe there’s no difference, but I’m curious if there’s some other issue that makes a third stream preferable.

Thanks, txbob!