I have three kernels, A, B, and C. A and B are independent and can be run in either order, or ideally concurrently.
Kernel C depends on the results of both A and B so it should be run after both kernels are finished.
I can schedule A and B on different streams and let the GPU automatically schedule the kernels, hopefully concurrently assuming the GPU has the resources for it.
But how do I efficiently schedule kernel C to begin after both A and B? The obvious method is to start the kernels A and B on two streams, call the CPU to synchronize on stream A and then call the CPU to synchronize again on stream B, then have the CPU launch kernel C. This works.
But these are short kernels and my Jetson CPU is not fast, so there’s a noticable CPU overhead and latency in this CPU spin state, enough to really hurt throughput since the GPU stays idle half the time while it waits for the (overloaded) CPU to juggle the scheduling. The CPU latency is exacerbated by the fact that the CPU is doing its own compute in parallel.
Is there a way to have the GPU or GPU driver launch kernel C for me after both A and B are done?
It seems like a terrible hack but I’m getting better performance by having A and B write a global “I’m done” flag to device memory when their last block finishes and if both are finished, launching kernel C using dynamic parallelism from inside of kernel A or B. This is just a terrible ugly hack, sensitive to race conditions, random memory fences, ugly and confusing to code and maintain, but it does seem to avoid some the CPU latency. Buty there’s still a gap between kernel launches and I’m losing performance. So there must be SOME solution that has at least that efficiency but isn’t a hack.
The other idea is to completely recode the three kernels into one monolithic kernel and do even more hacks for it to self-synchronize each dependent step of the compute. This seems like an even worse solution, though it might work.
Thanks for any ideas, guys!