Warp divergence problem implictly solved by launching multiple concurrent kernels?


I’m trying to program a collision detection system using CUDA wherein I process a large number of primitive shapes together in one batch and check for collisions between them.

Due to the fundamental differences in the shapes of the objects, the nature of the collision test to be performed differs and this leads to natural warp divergence as I have to check for the kinds of shapes I’m processing before applying a specific test. In total there are 4 different if-else cases to perform checks between 4 separate combinations of primitive shapes.

I have a GPU that supports concurrent kernel execution and I was wondering if a trivial solution to avoid the warp divergence problem would be to launch 4 separate kernels for the different combinations I’m trying to handle. Does launching 4 separate kernels and avoiding the overarching branching conditions in this way guarantee that there will be no warp divergence (does the scheduler guarantee that blocks/threads belonging to a particular kernel are scheduled independently from the others)?

I would be very grateful if someone with knowledge of the internals of scheduling could help in answering this, thank you very much in advance for your time and help!

Yours sincerely,