How to obtain the best performance of Dynamic Parallelism

luisgo · April 23, 2015, 2:11pm

Dear All

I am programming a K40 (cc 3.5). Some time ago I try to use dynamic parallelism. I declared several arrays in the parent kernel whose goes to stack frame (main video memory). But that way I could not pass by reference those arrays to children kernels. But I found that If I declare that arrays at first time in video memory (by the CPU) I obtain less performance than without dynamic parallelism. If I declare at first time the arrays in video memory instead in stack frame and without dynamic parallelism I obtain also less performance.

How is the best way to proceed?

Thanks

Luis Gonçalves

Robert_Crovella · April 23, 2015, 2:50pm

The difference between “stack frame” i.e. local memory usage and “video memory(by the CPU)” i.e. global memory usage should not really have much impact on performance unless the data is being accessed more than once. If the data is being accessed more than once, then move the data up the memory hierarchy to achieve higher performance. The method of doing this isn’t any different when using CDP kernels: most likely the data passed to the kernel is passed in global memory. From there, you could move it “up the hierarchy” for example by moving data that sees significant re-use into shared memory.

Having stated that, Dynamic Parallelism is not a universal method for increasing code performance. Dynamic Parallelism has benefits in terms of code reuse, reducing host/device synchronization traffic, and in general allowing the programmer to express algorithms in a more natural way. It may also have the benefit of allowing certain types of algorithm realizations to more fully utilize the GPU.

So a few of the above benefits (reducing host/device synchronization traffic, allowing certain types of algorithm realizations to more fully utilize the GPU), if they are significant in your application, may allow a code exploiting CDP to run faster than a corresponding code that does not. But CDP is not first and foremost a general technique for increasing code performance. If your code can be expressed in kernels that efficiently and fully utilize the GPU, it’s not likely that an alternate, CDP based realization will yield any significant performance benefits.

Topic		Replies	Views
Dynamic Parallelism improvement CUDA Programming and Performance	2	1031	February 15, 2013
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1231	May 27, 2013
dynamic parallelism performance CUDA Programming and Performance	4	1018	January 3, 2013
dynamic parallelism CUDA Programming and Performance	3	1160	December 30, 2012
Is this strategy not suitable for dynamic parallelism ? CUDA Programming and Performance	0	516	January 9, 2014
Dynamic parallelism vs flat kernels CUDA Programming and Performance	0	394	May 30, 2017
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1251	August 20, 2013
How much benefit can i get from dynamic parallelism in my code CUDA Programming and Performance	0	687	December 24, 2013
Local Arrays and Dynamic Parallelism CUDA Programming and Performance	1	850	January 23, 2015
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	911	June 3, 2024

How to obtain the best performance of Dynamic Parallelism

Related topics