Dynamic Parallelism and FFT


Does anyone have a device-callable FFT library? I had hoped that since dynamic parallelism has been out for over six months now there would be some news on whether cuFFT is going to be ported over to a device-callable version (with potentially some extra limitations) but so far every time I have asked I have been met with silence, or a quick followup question followed by silence. All that would need to be done is the actual FFT execution, the plan creation can be done by the host well before execution is needed.
I would have thought NVidia would be trying to push the idea of dynamic parallelism as it pushes the strengths of the newer architectures, yet I cannot find any information at all about when they are providing (or even if they are) of their computation libraries. This is a shame because the strength of the libraries is one of the biggest benefits of using CUDA over OpenCL.


There is a recent thread in which this point is discussed

See also the comment by talonmies.

In dynamic parallelism, the number of threads that you can launch from a thread is limited by the overall number of threads you can call on a specific device. This may be a limitation to developing FFT algorithms (not only the cuFFT) exploiting dynamic parallelism.

I’m not from NVIDIA and this is a personal opinion. Do not take it as an absolute truth. The best, of course, would be to have an answer from NVIDIA itself.

Thanks for the response, and yes it would be best to get an answer direct from NVidia on whether it is feasible and on their development roadmap. Unfortunately the last response I had from them on this was months ago and it did not really contain any helpful information.

If they are limited by how much they can instantiate to cope with every possible combination (worst case being each thread needing to launch >=1 FFT) then knowledge that they are developing it but hitting snags is better than silence.
My use case would only require one thread to call off to an FFT (a ‘master thread’) as my workflow is basically:

2D FFT -> 2D Transform -> 2D FFT -> 2D Transform -> 2D FFT -> 2D Transform -> 2D FFT

so the 0th thread could schedule the FFTs once the transform had completed. My main reasoning for wanting this quite badly is the serious limitation of WDDM which is causing kernel launches to be the limiting factor in execution.


If you need only one thread to perform the FFT, what are the other threads doing in the meanwhile? Just waiting? If so, why do you need dynamic parallelism?

Furthermore, in dynamic parallelism you are launching kernels from kernels, so I think you will have kernel launch overhead anyway.


I may have misunderstood a strength of dynamic parallelism then. I had believed that since the kernels were being launched from the device, the overhead associated with WDDM would no longer exist and the time to set up all of the parameters etc would be significantly reduced due to no host/device copying and syncing required.

My fundamental problem is my average kernel execution time is less than the time it takes for the host to prep the kernel (I am assuming almost entirely down to WDDM) so at times my GPU is sat idle waiting for kernels to execute. This actually equates to something like an increase of 30% runtime just for this extra overhead.

Changing away from Win7 is not currently an option, so I was hoping Dynamic Parallelism would at least shrink down the number of kernel calls initiated by the host. In my mind the removal of the 3 kernel overheads per FFT would shrink it down to the point where the GPU was no longer idle.

Obviously if the launch overhead of a kernel from the device is >= to the launch overhead (including WDDM crapness) from the host then this is a lesson in futility, but I have not found any performance specs from a dynamic parallelism launch.