Kernel with mixed requirements for gridcount


I have a kernel that is split into 2 parts. the second one needs for the calculation the results of the first, but needs 4 times the number of grids of the first one.

As of today I start the kernel with the grid requirements of the second part and skip the unused threads in the global function. If I run 2 separate kernel i need to save the results in the global memory which reduces the overall performance of the kernel a lot.

Would it make sense to use the new CUDA Dynamic Parallelism for this problem or is skipping unused threads already a good approach?


With Dynamic Parallelism you cannot simply access the computed values unless you store them in global memory.
You could process 4 elements per thread in the second stage and use the grid size of the first stage. You could implement your kernel in grid-strided fashion. But a doubt any of it will have significant impact on performance compared to just masking out the threads.

Got it. Thanks a lot. Wasn’t sure if skipping is that bad that it makes sense to look for a different solution.