Dynamic Parallelism improvement

Hi everyone,

In order to determine in which compute cases the Dynamic Parallelism is interesting I did some tests on my Tesla K20 :

  • Starting from the NVIDIA Sample "cdpLUDecomposition" I implemented a "classical" version (without DP) of the decomposition (just switch the parent kernel by a C function and add some memory transfers). My implementation is relatively ugly but it works, and curiously it works faster than the original sample (142 GFLOPS vs 82 GFLOPS on a 8192*8912 matrix) ! Ok, may be the Sample "cdpLUDecomposition" is just a demonstration of how to use DP but not of its potential.
  • I tried DP on an homemade application which processes hundreds matrix. The process is the same and is independent for each matrix. So I did a main (parent) kernel where each mono-threaded block processed one matrix and launched child kernels. I did an other version where the main (parent) kernel had only one mono-threaded block and where the matrix processing was dispatched on one dimension of the child kernels. In the both cases, the "classical" version of my application is faster…

I am not defeatist, I would like to know if someone has some example where the Dynamic Parallelism brings an real gain ?
What the characteristics of an adapted use case ?

Thanks in advance,
Guix

Please, take a look at the discussion

http://stackoverflow.com/questions/14855408/bottlneck-of-dynamic-programming-in-cuda-global-memory-allocations-to-exchange

For an interpolation problem, I have finally improved my results using dynamic programming.

Hi, thanks for your answer, I will take a look !

Guix