The difference between “stack frame” i.e. local memory usage and “video memory(by the CPU)” i.e. global memory usage should not really have much impact on performance unless the data is being accessed more than once. If the data is being accessed more than once, then move the data up the memory hierarchy to achieve higher performance. The method of doing this isn’t any different when using CDP kernels: most likely the data passed to the kernel is passed in global memory. From there, you could move it “up the hierarchy” for example by moving data that sees significant re-use into shared memory.
Having stated that, Dynamic Parallelism is not a universal method for increasing code performance. Dynamic Parallelism has benefits in terms of code reuse, reducing host/device synchronization traffic, and in general allowing the programmer to express algorithms in a more natural way. It may also have the benefit of allowing certain types of algorithm realizations to more fully utilize the GPU.
So a few of the above benefits (reducing host/device synchronization traffic, allowing certain types of algorithm realizations to more fully utilize the GPU), if they are significant in your application, may allow a code exploiting CDP to run faster than a corresponding code that does not. But CDP is not first and foremost a general technique for increasing code performance. If your code can be expressed in kernels that efficiently and fully utilize the GPU, it’s not likely that an alternate, CDP based realization will yield any significant performance benefits.