dynamic parallelism performance

kdim218 · January 3, 2013, 7:21pm

Would dynamic parallelism speed up performance for a routine that ran an unknown amount of times? Here is an example:

Without Dynamic Parallelism

int main(){
    //init...
    while(flag){
        work<<<x,y>>>(data, deviceFlag);
        cudaMemcpy(flag, deviceFlag, sizeof(bool), cudaMemcpyDeviceToHost);
    }
}

With Dynamic Parallelism

int main(){
    //init...
    dynamicKernel<<<1,1>>>();
}

__global__ void dynamicKernel(...){
    while(deviceFlag){
        work<<<x,y>>>(data, deviceFlag);
        cudaDeviceSynchronize();
    }
}

In the beginning I was hoping to remove the communication between CPU and GPU and improve performance. The synchronize after the child kernel seems to be preventing the performance gain. Also, I’m guessing that calling parent<<<1,1>>> is naive. Is this too simple of a dynamic application for an actual performance gain? Am I missing something?

Thanks in advance.

Gregory_Diamos · January 3, 2013, 7:45pm

You might try this using tail recursion rather than a loop to avoid the cudaDeviceSynchronize. I
don’t know if this will speed anything up since you will still have a global barrier between work
and the next iteration.

For example:

int main(){
       //init...
       dynamicKernel();
    }

    __global__ void dynamicKernel(...){
        if (deviceFlag) {
            work(data, deviceFlag);
            dynamicKernel();
        }
    }

kdim218 · January 3, 2013, 10:04pm

Thanks for the input Greg, unfortunately there was no speedup as you suggested.

Gregory_Diamos · January 3, 2013, 10:28pm

It would probably be a good idea to make sure that the launch latency is actually the performance bottleneck.

Can you run a contrived example where you launch exactly the right number of iterations at once and see if you get a performance gain?

kdim218 · January 3, 2013, 10:43pm

I cannot unroll the iterations, but what I did try was to remove almost all of the work. I simply set an array to zeros and sync/memcpy. The performance was similar with the nested kernel being a little slower. I’m guessing this means the barrier from the device is more costly than a barrier from the CPU (even though there is an extra transfer on the PCI bus!)

Topic		Replies	Views
dynamic parallelism CUDA Programming and Performance	3	1164	December 30, 2012
How much benefit can i get from dynamic parallelism in my code CUDA Programming and Performance	0	689	December 24, 2013
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1257	August 20, 2013
Cuda Dynamic Parallelism Performance CUDA Programming and Performance	3	2022	July 14, 2016
Is this strategy not suitable for dynamic parallelism ? CUDA Programming and Performance	0	522	January 9, 2014
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1232	May 27, 2013
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	983	June 3, 2024
Even without sync, a parallel reduction sum using dynamic parallelism works !? CUDA Programming and Performance	2	988	March 14, 2017
Dynamic Parallelism extreme slowdown CUDA Programming and Performance	0	874	April 7, 2013
Synchronization in Dynamic Parallelism CUDA Programming and Performance cuda	3	1144	October 17, 2023

dynamic parallelism performance

Related topics