How to obtain the best performance of Dynamic Parallelism

Dear All

I am programming a K40 (cc 3.5). Some time ago I try to use dynamic parallelism. I declared several arrays in the parent kernel whose goes to stack frame (main video memory). But that way I could not pass by reference those arrays to children kernels. But I found that If I declare that arrays at first time in video memory (by the CPU) I obtain less performance than without dynamic parallelism. If I declare at first time the arrays in video memory instead in stack frame and without dynamic parallelism I obtain also less performance.

How is the best way to proceed?

Thanks

Luis Gonçalves

The difference between “stack frame” i.e. local memory usage and “video memory(by the CPU)” i.e. global memory usage should not really have much impact on performance unless the data is being accessed more than once. If the data is being accessed more than once, then move the data up the memory hierarchy to achieve higher performance. The method of doing this isn’t any different when using CDP kernels: most likely the data passed to the kernel is passed in global memory. From there, you could move it “up the hierarchy” for example by moving data that sees significant re-use into shared memory.

Having stated that, Dynamic Parallelism is not a universal method for increasing code performance. Dynamic Parallelism has benefits in terms of code reuse, reducing host/device synchronization traffic, and in general allowing the programmer to express algorithms in a more natural way. It may also have the benefit of allowing certain types of algorithm realizations to more fully utilize the GPU.

So a few of the above benefits (reducing host/device synchronization traffic, allowing certain types of algorithm realizations to more fully utilize the GPU), if they are significant in your application, may allow a code exploiting CDP to run faster than a corresponding code that does not. But CDP is not first and foremost a general technique for increasing code performance. If your code can be expressed in kernels that efficiently and fully utilize the GPU, it’s not likely that an alternate, CDP based realization will yield any significant performance benefits.