I have a kernel that is split into 2 parts. the second one needs for the calculation the results of the first, but needs 4 times the number of grids of the first one.
As of today I start the kernel with the grid requirements of the second part and skip the unused threads in the global function. If I run 2 separate kernel i need to save the results in the global memory which reduces the overall performance of the kernel a lot.
Would it make sense to use the new CUDA Dynamic Parallelism for this problem or is skipping unused threads already a good approach?