I am writing some cuda code for an application that basically first reads some data from global memory, do some simple calculation and generate some data A. Then data A (small) is used to do some more calculation that requires a lot more memory reads. Since the two steps are very different and they don’t work well in one single kernel as it will require more registers, I am thinking of spliting the kernel into two. But this will require that the first kernel to write data A into global memory so that the second step can use it. If one single kernel is used, this extra writes and reads are not required.
I am wondering if this application is suitable for dynamic parallelism (DP). I read the guide for DP but could not find the implementation detail on it. When the parent calls the child, will the parent still reside in the SM and occupying resources? If yes, I don’t see it will get any performance improvement. If not, will it incur some context switch overhead?