Cuda Dynamic Parallelism Performance

Hi Guys,

I am a newbie to Cuda. I am currently doing a performance comparison in Dynamic parallelism.
I have three kernels. I compared the performance with Host kernel launching and Device Kernel Launching (dynamic parallelism).

Dynamic parallelism parent kernel dimensions are
grid - (1, 0, 0)
block - (1, 0, 0)
And Each child kernels dimensions are detailed below. Launching happens after completion of previous kernel (Not recursicely). I have set the “cudaLimitDevRuntimeSyncDepth” to be 2, cudaLimitDevRuntimePendingLaunchCount" = 1024* 128

Host kernel launching dimensions are same to child kernel dimensions.

Followings are my kernel dimensions, Time taken to execute from Host launching, Device Launching.

       |    Calculation Type  |  Grid Dimension  |   Block Dimension |  Host Launch  | Device Launch |

-----------|----------------------|------------------|-------------------|---------------|---------------|
Kernel -1-|Map operation-------|-----1024-------|—1024------------|--------52.7us-|------119.2us–|
Kernel -2-|Reduce operation----|-----1024-------|—1024------------|-------183.7us-|------334.9us–|
Kernel -3-|Sort operation------|--------1-------|----512------------|-------221.7us-|------383.3us–|

I found some more details from [here][/http://users.ece.gatech.edu/~sudha/academic/class/ece8823/Lectures/Module-6-Microarchitecture/cuda-dyn-par.pdf]

The presentation explains dynamic Parallelism have some overhead in synchronization. And it says the kernel execution time should in be same.
But I observed the dynamic parallelism kernel execution time is higher than host kernel launching time.

I am not sure about the is there results. Or am I doing something wrong?

Test Enviroment
GPU - GeForce GTX 980
OS - Red Hat Enterprise Linux Server release 6.6 (Linux k7-1 2.6.32-504.el6.x86_64)
CPU - Intel(R) Core™ i7-4770 CPU @ 3.40GHz
The time stamps are taken after second running iteration.

Thank you in advance.

Vishwa

you can use the code tag (last one in toolbox above edit box) to nicely format your table

cross posted (where the formatting is better):

[url]http://stackoverflow.com/questions/38343526/cuda-dynamic-parallelism-performance[/url]

I will use mentioned feature next time… Thank you… :). Actually I went to stack overflow due to that formatting issue…

I update the question there with more details…