In order to determine in which compute cases the Dynamic Parallelism is interesting I did some tests on my Tesla K20 :
- Starting from the NVIDIA Sample "cdpLUDecomposition" I implemented a "classical" version (without DP) of the decomposition (just switch the parent kernel by a C function and add some memory transfers). My implementation is relatively ugly but it works, and curiously it works faster than the original sample (142 GFLOPS vs 82 GFLOPS on a 8192*8912 matrix) ! Ok, may be the Sample "cdpLUDecomposition" is just a demonstration of how to use DP but not of its potential.
- I tried DP on an homemade application which processes hundreds matrix. The process is the same and is independent for each matrix. So I did a main (parent) kernel where each mono-threaded block processed one matrix and launched child kernels. I did an other version where the main (parent) kernel had only one mono-threaded block and where the matrix processing was dispatched on one dimension of the child kernels. In the both cases, the "classical" version of my application is faster…
I am not defeatist, I would like to know if someone has some example where the Dynamic Parallelism brings an real gain ?
What the characteristics of an adapted use case ?
Thanks in advance,