Would dynamic parallelism speed up performance for a routine that ran an unknown amount of times? Here is an example:
Without Dynamic Parallelism
int main(){
//init...
while(flag){
work<<<x,y>>>(data, deviceFlag);
cudaMemcpy(flag, deviceFlag, sizeof(bool), cudaMemcpyDeviceToHost);
}
}
With Dynamic Parallelism
int main(){
//init...
dynamicKernel<<<1,1>>>();
}
__global__ void dynamicKernel(...){
while(deviceFlag){
work<<<x,y>>>(data, deviceFlag);
cudaDeviceSynchronize();
}
}
In the beginning I was hoping to remove the communication between CPU and GPU and improve performance. The synchronize after the child kernel seems to be preventing the performance gain. Also, I’m guessing that calling parent<<<1,1>>> is naive. Is this too simple of a dynamic application for an actual performance gain? Am I missing something?
Thanks in advance.
You might try this using tail recursion rather than a loop to avoid the cudaDeviceSynchronize. I
don’t know if this will speed anything up since you will still have a global barrier between work
and the next iteration.
For example:
int main(){
//init...
dynamicKernel();
}
__global__ void dynamicKernel(...){
if (deviceFlag) {
work(data, deviceFlag);
dynamicKernel();
}
}
Thanks for the input Greg, unfortunately there was no speedup as you suggested.
It would probably be a good idea to make sure that the launch latency is actually the performance bottleneck.
Can you run a contrived example where you launch exactly the right number of iterations at once and see if you get a performance gain?
I cannot unroll the iterations, but what I did try was to remove almost all of the work. I simply set an array to zeros and sync/memcpy. The performance was similar with the nested kernel being a little slower. I’m guessing this means the barrier from the device is more costly than a barrier from the CPU (even though there is an extra transfer on the PCI bus!)